nips nips2005 nips2005-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. [sent-7, score-0.928]
2 Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. [sent-9, score-1.104]
3 For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. [sent-10, score-0.133]
4 We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. [sent-11, score-0.269]
5 We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. [sent-12, score-0.714]
6 Other than being associated with two different views on the origin of man, they also have colleges at Cambridge University named after them. [sent-15, score-0.057]
7 If these two names are entered as a query into Google™ Sets (http://labs. [sent-16, score-0.403]
8 com/sets) it returns a list of other colleges at Cambridge. [sent-18, score-0.057]
9 Depending on the application, the set D may consist of web pages, movies, people, words, proteins, images, or any other object we may wish to form queries on. [sent-21, score-0.122]
10 The user provides a query in the form of a very small subset of items Dc ⊂ D. [sent-22, score-0.61]
11 The assumption is that the elements in Dc are examples of some concept / class / cluster in the data. [sent-23, score-0.173]
12 The algorithm then has to provide a completion to the set Dc —that is, some set Dc ⊂ D which presumably includes all the elements in Dc and other elements in D which are also in this concept / class / cluster2 . [sent-24, score-0.115]
13 Google™ Sets is a large-scale clustering algorithm that uses many millions of data instances extracted from web data (Simon Tong, personal communication). [sent-26, score-0.05]
14 First, the query can be interpreted as elements of some unknown cluster, and the output of the algorithm is the completion of that cluster. [sent-30, score-0.442]
15 Whereas most clustering algorithms are completely unsupervised, here the query provides supervised hints or constraints as to the membership of a particular cluster. [sent-31, score-0.433]
16 We call this view clustering on demand, since it involves forming a cluster once some elements of that cluster have been revealed. [sent-32, score-0.233]
17 An important advantage of this approach over traditional clustering is that the few elements in the query can give useful information as to the features which are relevant for forming the cluster. [sent-33, score-0.442]
18 For example, the query “Bush”, “Nixon”, “Reagan” suggests that the features republican and US President are relevant to the cluster, while the query “Bush”, “Putin”, “Blair” suggests that current and world leader are relevant. [sent-34, score-0.806]
19 Given the huge number of features in many real world data sets, such hints as to feature relevance can produce much more sensible clusters. [sent-35, score-0.135]
20 As in other retrieval problems, the output should be relevant to the query, and it makes sense to limit the output to the top few items ranked by relevance to the query. [sent-37, score-0.331]
21 In our experiments, we take this approach and report items ranked by relevance. [sent-38, score-0.242]
22 Our relevance criterion is closely related to a Bayesian framework for understanding patterns of generalization in human cognition [5]. [sent-39, score-0.041]
23 2 Bayesian Sets Let D be a data set of items, and x ∈ D be an item from this set. [sent-40, score-0.084]
24 Assume the user provides a query set Dc which is a small subset of D. [sent-41, score-0.403]
25 Our goal is to rank the elements of D by how well they would “fit into” a set which includes Dc . [sent-42, score-0.039]
26 Intuitively, the task is clear: if the set D is the set of all movies, and the query set consists of two animated Disney movies, we expect other animated Disney movies to be ranked highly. [sent-43, score-0.708]
27 We use a model-based probabilistic criterion to measure how well items fit into Dc . [sent-44, score-0.207]
28 Ranking items simply by this probability is not sensible since some items may be more probable than others, regardless of Dc . [sent-47, score-0.478]
29 For example, under most sensible models, the probability of a string decreases with the number of characters, the probability of an image decreases with the number of pixels, and the probability of any continuous variable decreases with the precision to which it is measured. [sent-48, score-0.064]
30 We want to remove these effects, so we compute the ratio: score(x) = p(x|Dc ) p(x) (1) where the denominator is the prior probability of x and under most sensible models will scale exactly correctly with number of pixels, characters, discretization level, etc. [sent-49, score-0.064]
31 Using Bayes rule, this score can be re-written as: score(x) = p(x, Dc ) p(x) p(Dc ) (2) which can be interpreted as the ratio of the joint probability of observing x and Dc , to the probability of independently observing x and Dc . [sent-50, score-0.179]
32 Finally, up to a multiplicative constant independent of x, the score can be written as: score(x) = p(Dc |x), which is the probability of observing the query set given x (i. [sent-52, score-0.582]
33 A natural model-based way of defining a cluster is to assume that Figure 1: Our Bayesian score compares the hypotheses that the data was generated by each of the above graphical models. [sent-56, score-0.276]
34 the data points in the cluster all come independently and identically distributed from some simple parameterized statistical model. [sent-57, score-0.097]
35 In fact, for the model we consider in section 3 computing all the scores can be reduced to a single sparse matrix multiplication. [sent-62, score-0.045]
36 Although it clearly makes sense to put some thought into choosing sensible models p(x|θ) and priors p(θ), we will show in 5 that even with very simple models and almost no tuning of the prior one can get very competitive retrieval results. [sent-64, score-0.102]
37 In practice, we use a simple empirical heuristic which sets the prior to be vague but centered on the mean of the data in D. [sent-65, score-0.135]
38 3 Sparse Binary Data We now derive in more detail the application of the Bayesian Sets algorithm to sparse binary data. [sent-66, score-0.045]
39 This type of data is a very natural representation for the large datasets we used in our evaluations (section 5). [sent-67, score-0.037]
40 Applications of Bayesian Sets to other forms of data (realvalued, discrete, ordinal, strings) are also possible, and especially practical if the statistical model is a member of the exponential family (section 4). [sent-68, score-0.067]
41 Assume each item xi ∈ Dc is a binary vector xi = (xi1 , . [sent-69, score-0.144]
42 For a query Dc = {xi } consisting of N vectors it is easy to show that: ˜ α Γ(αj + βj ) Γ(˜ j )Γ(βj ) ˜ Γ(αj )Γ(βj ) Γ(˜ j + βj ) α p(Dc |α, β) = j (9) N N ˜ where α = α + i=1 xij and β = β + N − i=1 xij . [sent-73, score-0.558]
43 Each query Dc corresponds to computing the vector q and scalar c. [sent-83, score-0.403]
44 This can also be done efficiently if the query is also sparse, since most elements of q will equal log βj − log(βj + N ) which is independent of the query. [sent-84, score-0.479]
45 4 Exponential Families We generalize the above result to models in the exponential family. [sent-85, score-0.03]
46 The conjugate prior is p(θ|η, ν) = h(η, ν)g(θ)η exp{θ ν}, where η and ν are hyperparameters, and h normalizes the distribution. [sent-87, score-0.028]
47 First of all, the score only depends on the size of the query (N ), the sufficient statistics computed from each candidate, and from the whole query. [sent-89, score-0.582]
48 Second, whether the score is a linear operation on U depends on whether log h is linear in the second argument. [sent-91, score-0.216]
49 This is the case for the Bernoulli distribution, but not for all exponential family distributions. [sent-92, score-0.067]
50 However, for many distributions, such as diagonal covariance Gaussians, even though the score is nonlinear in U, it can be computed by applying the nonlinearity elementwise to U. [sent-93, score-0.179]
51 For sparse matrices, the score can therefore still be computed in time linear in the number of non-zero elements of U. [sent-94, score-0.263]
52 The Groliers dataset is 30991 articles by 15276 words, where the entries are the number of times each word appears in each document. [sent-96, score-0.16]
53 We preprocess (binarize) the data by column normalizing each word, and then thresholding so that a (article,word) entry is 1 if that word has a frequency of more than twice the article mean. [sent-97, score-0.056]
54 The analogous priors are used for both other datasets. [sent-100, score-0.038]
55 The EachMovie dataset was preprocessed, first by removing movies rated by less than 15 people, and people who rated less than 200 movies. [sent-101, score-0.394]
56 Then the dataset was binarized so that a (person, movie) entry had value 1 if the person gave the movie a rating above 3 stars (from a possible 0-5 stars). [sent-102, score-0.25]
57 The data was then column normalized to account for overall movie popularity. [sent-103, score-0.126]
58 The size of the dataset after preprocessing was 1813 people by 1532 movies. [sent-104, score-0.086]
59 Finally the NIPS author dataset (13649 words by 2037 authors), was preprocessed very similarly to the Grolier dataset. [sent-105, score-0.188]
60 It was binarized by column normalizing each author, and then thresholding so that a (word,author) entry is 1 if the author uses that word more frequently than twice the word mean across all authors. [sent-106, score-0.19]
61 The results of our experiments, and comparisons with Google Sets for word and movie queries are given in tables 2 and 3. [sent-107, score-0.254]
62 Unfortunately, NIPS authors have not yet achieved the kind of popularity on the web necessary for Google Sets to work effectively. [sent-108, score-0.09]
63 Instead we list the top words associated with the cluster of authors given by our algorithm (table 4). [sent-109, score-0.236]
64 The running times of our algorithm on all three datasets are given in table 1. [sent-110, score-0.07]
65 Our algorithm is very fast both at pre-processing the data, and answering queries (about 1 sec per query). [sent-112, score-0.072]
66 47 S Table 1: For each dataset we give the size of that dataset along with the time taken to do the (onetime) preprocessing and the time taken to make a query (both in seconds). [sent-119, score-0.515]
67 The top few are shown for each query and each algorithm. [sent-121, score-0.451]
68 One person’s idea of a good query cluster may differ drastically from another person’s. [sent-124, score-0.5]
69 Moreover, Google Sets relies on vast amounts of web data, which we do not have. [sent-127, score-0.05]
70 We found that Google Sets performed very well when the query consisted of items which can be found listed on the web (e. [sent-129, score-0.66]
71 “soldier” and “warrior”, see Table 2) our algorithm returned more sensible completions. [sent-134, score-0.097]
72 The top query in table 3 consists of two classic romantic movies, 3 In fact, one of the example queries on the Google Sets website is a query of movie titles. [sent-136, score-1.085]
73 The top 10 are shown for each query and each algorithm. [sent-138, score-0.451]
74 and while most of the movies returned by Bayesian Sets are also classic romances, hardly any of the movies returned by Google Sets are romances, and it would be difficult to call “Ernest Saves Christmas” either a romance or a classic. [sent-140, score-0.486]
75 Both “Cutthroat Island” and “Last Action Hero” are action movie flops, as are many of the movies given by our algorithm for that query. [sent-141, score-0.367]
76 All the Bayes Sets movies associated with the query “Mary Poppins” and “Toy Story” are children’s movies, while 5 of Google Sets’ movies are not. [sent-142, score-0.823]
77 “But I’m a Cheerleader”, while appearing to be a children’s movie, is actually an R rated movie involving lesbian and gay teens. [sent-143, score-0.175]
78 M C A LLESTER TOP WORDS DECISION REINFORCEMENT ACTIONS REWARDS REWARD START RETURN RECEIVED MDP SELECTS Table 4: NIPS authors found by Bayesian Sets based on the given queries. [sent-180, score-0.04]
79 The top 10 are shown for each query along with the top 10 words associated with that cluster of authors. [sent-181, score-0.647]
80 The NIPS author dataset is rather small, and co-authors of NIPS papers appear very similar to each other. [sent-183, score-0.104]
81 Therefore, many of the authors found by our algorithm are co-authors of a NIPS paper with one or more of the query authors. [sent-184, score-0.443]
82 As part of the evaluation of our algorithm, we showed 30 na¨ve subjects the unlabeled ı results of Bayesian Sets and Google Sets for the queries shown from the EachMovie and Groliers Encyclopedia datasets, and asked them to choose which they preferred. [sent-186, score-0.072]
83 Descriptions and results can be found in supplemental material on the authors websites. [sent-202, score-0.04]
84 6 Conclusions We have described an algorithm which takes a query consisting of a small set of items, and returns additional items which belong in this set. [sent-203, score-0.639]
85 Our algorithm computes a score for each item by comparing the posterior probability of that item given the set, to the prior probability of that item. [sent-204, score-0.347]
86 For exponential family models with conjugate priors, our score can be computed exactly and efficiently. [sent-206, score-0.274]
87 In fact, we show that for sparse binary data, scoring all items in a large data set can be accomplished using a single sparse matrix-vector multiplication. [sent-207, score-0.297]
88 For example, a sparse data set with over 2 million nonzero entries (Grolier) can be queried in just over 1 second. [sent-209, score-0.045]
89 The problem of retrieving sets of items is clearly relevant to many application domains. [sent-214, score-0.387]
90 We believe that with even larger datasets the Bayesian Sets algorithm will be a very useful tool for many application areas. [sent-220, score-0.037]
91 (2002) Probabilistic relevance models based on document and query generation. [sent-228, score-0.444]
wordName wordTfidf (topN-words)
[('dc', 0.41), ('query', 0.403), ('google', 0.396), ('ets', 0.245), ('movies', 0.21), ('items', 0.207), ('score', 0.179), ('sets', 0.135), ('movie', 0.126), ('eachmovie', 0.113), ('oogle', 0.113), ('bayesian', 0.104), ('cluster', 0.097), ('uery', 0.094), ('warrior', 0.094), ('item', 0.084), ('grolier', 0.076), ('story', 0.075), ('queries', 0.072), ('encyclopedia', 0.066), ('sensible', 0.064), ('xij', 0.063), ('casablanca', 0.057), ('colleges', 0.057), ('cutthroat', 0.057), ('embers', 0.057), ('groliers', 0.057), ('mary', 0.057), ('poppins', 0.057), ('soldier', 0.057), ('dataset', 0.056), ('word', 0.056), ('bayes', 0.055), ('words', 0.051), ('web', 0.05), ('nips', 0.05), ('hero', 0.049), ('rated', 0.049), ('top', 0.048), ('articles', 0.048), ('author', 0.048), ('sparse', 0.045), ('completions', 0.045), ('wind', 0.045), ('retrieving', 0.045), ('toy', 0.043), ('island', 0.042), ('relevance', 0.041), ('authors', 0.04), ('tong', 0.04), ('elements', 0.039), ('person', 0.038), ('addams', 0.038), ('ary', 0.038), ('aul', 0.038), ('cheerleader', 0.038), ('cholkopf', 0.038), ('disney', 0.038), ('ernest', 0.038), ('fish', 0.038), ('gone', 0.038), ('ime', 0.038), ('ish', 0.038), ('mola', 0.038), ('nimal', 0.038), ('oppins', 0.038), ('romances', 0.038), ('sland', 0.038), ('utthroat', 0.038), ('utton', 0.038), ('priors', 0.038), ('log', 0.037), ('datasets', 0.037), ('family', 0.037), ('concept', 0.037), ('ranked', 0.035), ('table', 0.033), ('returned', 0.033), ('ero', 0.033), ('bush', 0.033), ('christmas', 0.033), ('conferences', 0.033), ('food', 0.033), ('night', 0.033), ('plant', 0.033), ('preprocessed', 0.033), ('zg', 0.033), ('action', 0.031), ('multiplication', 0.03), ('water', 0.03), ('exponential', 0.03), ('hints', 0.03), ('animated', 0.03), ('binarized', 0.03), ('saves', 0.03), ('people', 0.03), ('xi', 0.03), ('consisting', 0.029), ('bernoulli', 0.029), ('conjugate', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999899 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
2 0.11819521 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
3 0.084299944 104 nips-2005-Laplacian Score for Feature Selection
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
4 0.080434062 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng
Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.
5 0.070533633 70 nips-2005-Fast Information Value for Graphical Models
Author: Brigham S. Anderson, Andrew W. Moore
Abstract: Calculations that quantify the dependencies between variables are vital to many operations with graphical models, e.g., active learning and sensitivity analysis. Previously, pairwise information gain calculation has involved a cost quadratic in network size. In this work, we show how to perform a similar computation with cost linear in network size. The loss function that allows this is of a form amenable to computation by dynamic programming. The message-passing algorithm that results is described and empirical results demonstrate large speedups without decrease in accuracy. In the cost-sensitive domains examined, superior accuracy is achieved.
6 0.070313811 41 nips-2005-Coarse sample complexity bounds for active learning
7 0.068243369 84 nips-2005-Generalization in Clustering with Unobserved Features
8 0.061497077 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
9 0.054856922 102 nips-2005-Kernelized Infomax Clustering
10 0.054761156 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
11 0.052791711 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
12 0.0508219 171 nips-2005-Searching for Character Models
13 0.049104672 48 nips-2005-Context as Filtering
14 0.048466302 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
15 0.046591315 4 nips-2005-A Bayesian Spatial Scan Statistic
16 0.046070337 59 nips-2005-Dual-Tree Fast Gauss Transforms
17 0.04460407 52 nips-2005-Correlated Topic Models
18 0.04440799 114 nips-2005-Learning Rankings via Convex Hull Separation
19 0.043247063 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
20 0.043026954 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
topicId topicWeight
[(0, 0.141), (1, 0.043), (2, -0.01), (3, 0.032), (4, -0.001), (5, -0.039), (6, 0.007), (7, 0.073), (8, -0.085), (9, 0.081), (10, -0.054), (11, -0.017), (12, 0.141), (13, -0.045), (14, -0.057), (15, -0.04), (16, 0.012), (17, -0.023), (18, -0.078), (19, -0.027), (20, 0.014), (21, 0.003), (22, 0.092), (23, -0.073), (24, -0.006), (25, -0.002), (26, -0.047), (27, 0.053), (28, -0.05), (29, -0.066), (30, -0.088), (31, 0.092), (32, 0.008), (33, 0.07), (34, -0.17), (35, -0.057), (36, -0.102), (37, -0.26), (38, 0.066), (39, -0.013), (40, 0.047), (41, 0.036), (42, -0.06), (43, -0.003), (44, -0.152), (45, -0.001), (46, -0.032), (47, 0.07), (48, 0.075), (49, -0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.94123191 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
2 0.56824118 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
3 0.53545409 104 nips-2005-Laplacian Score for Feature Selection
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
4 0.50703865 84 nips-2005-Generalization in Clustering with Unobserved Features
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes. 1
5 0.4645108 59 nips-2005-Dual-Tree Fast Gauss Transforms
Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray
Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.
6 0.38491422 70 nips-2005-Fast Information Value for Graphical Models
7 0.3798289 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
8 0.37330952 4 nips-2005-A Bayesian Spatial Scan Statistic
9 0.33475488 41 nips-2005-Coarse sample complexity bounds for active learning
10 0.33136278 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
11 0.3201983 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
12 0.3123982 102 nips-2005-Kernelized Infomax Clustering
13 0.30554017 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters
14 0.30481386 52 nips-2005-Correlated Topic Models
15 0.30109304 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
16 0.29054734 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling
17 0.28897223 48 nips-2005-Context as Filtering
18 0.28017026 159 nips-2005-Q-Clustering
19 0.27426696 117 nips-2005-Learning from Data of Variable Quality
20 0.27299523 133 nips-2005-Nested sampling for Potts models
topicId topicWeight
[(3, 0.051), (10, 0.021), (27, 0.034), (31, 0.039), (34, 0.055), (39, 0.019), (40, 0.384), (41, 0.019), (55, 0.025), (65, 0.012), (69, 0.068), (73, 0.058), (80, 0.011), (88, 0.083), (91, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.77790934 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
2 0.6172452 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
Author: Maxim Raginsky, Svetlana Lazebnik
Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.
3 0.36522666 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia
Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.
4 0.36339933 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
5 0.35694462 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore
Abstract: In this paper we consider the problem of finding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efficiently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.
6 0.35581082 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
7 0.35510308 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
8 0.35501391 30 nips-2005-Assessing Approximations for Gaussian Process Classification
9 0.35440066 45 nips-2005-Conditional Visual Tracking in Kernel Space
10 0.35226676 74 nips-2005-Faster Rates in Regression via Active Learning
11 0.35146576 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
12 0.34927797 144 nips-2005-Off-policy Learning with Options and Recognizers
13 0.34889784 35 nips-2005-Bayesian model learning in human visual perception
14 0.34832498 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
15 0.34758282 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
16 0.34738886 151 nips-2005-Pattern Recognition from One Example by Chopping
17 0.34722191 48 nips-2005-Context as Filtering
18 0.34714317 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
19 0.34637165 133 nips-2005-Nested sampling for Potts models
20 0.34634247 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification