nips nips2001 nips2001-53 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. [sent-3, score-0.104]
2 In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. [sent-4, score-0.396]
3 However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. [sent-5, score-0.233]
4 We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. [sent-6, score-0.386]
5 The algorithm is simpler than previous formulations and makes fewer independence assumptions. [sent-7, score-0.074]
6 Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. [sent-8, score-0.163]
7 By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples. [sent-9, score-0.233]
8 1 Introduction Many cognitive models posit mental representations based on discrete substructures. [sent-10, score-0.104]
9 Even connectionist models whose processing involves manipulation of real-valued activations typically represent objects as patterns of 0s and 1s across a set of units (Noelle, Cottrell, and Wilms, 1997). [sent-11, score-0.087]
10 Often, individual units are taken to represent specific features of the objects and two representations will share features to the degree to which the two objects are similar. [sent-12, score-0.328]
11 Using random feature assignments clouds the relationship between the model and the objects it is intended to represent, diminishing the model’s value. [sent-14, score-0.087]
12 These difficulties effectively limit the number of objects that can be encoded, constraining modeling efforts to small examples. [sent-16, score-0.087]
13 In this paper, we investigate methods for automatically synthesizing feature-based representations directly from the pairwise object similarities that the model is intended to respect. [sent-17, score-0.099]
14 049 ptkfθs˘ s unvoiced approach eliminates the manual burden of selecting and assigning features while providing an explicit design criterion that objectively connects the representations to empirical data. [sent-31, score-0.211]
15 We will then investigate a new approach, based on combinatorial optimization. [sent-33, score-0.155]
16 When using a novel heuristic search technique, we find that the new approach, despite its simplicity, performs better than previous algorithms and that, perhaps more important, it maintains its effectiveness on large problems. [sent-34, score-0.152]
17 1 Additive Clustering We will formalize the problem of constructing discrete features from similarity information using the additive clustering model of Shepard and Arabie (1979). [sent-36, score-0.404]
18 Each of the k features has a non-negative real-valued weight wk , and the similarity between two objects i and j is just the sum of the weights of the features they share. [sent-38, score-0.359]
19 If f ik is 1 if object i has feature k and 0 otherwise, and c is a real-valued constant, then the similarity of i and j is modeled as sij = ˆ wk fik fjk + c . [sent-39, score-0.3]
20 The representation of each object is simply the binary column specifying its membership or absence in each cluster. [sent-42, score-0.074]
21 Additive clustering is asymmetric in the sense that only the shared features of two objects contribute to their similarity, not the ones they both lack. [sent-43, score-0.274]
22 ) With a model formalism in hand, we can then phrase the problem of constructing feature assignments as simply finding the A DCLUS model that best matches the given similarity data using the desired number of features. [sent-45, score-0.153]
23 ¯ 2 Previous Algorithms Additive clustering is a difficult 0-1 quadratic programming problem and only heuristic methods, which do not guarantee an optimal model, have been proposed. [sent-47, score-0.177]
24 Non-discrete Approximation: Arabie and Carroll (1980) and Carroll and Arabie (1983) proposed the two-stage indclus algorithm. [sent-51, score-0.432]
25 In the first stage, cluster memberships are treated as real values and optimized for each cluster in turn by gradient descent. [sent-52, score-0.484]
26 Afterwards, a combinatorial clean-up stage tries all possible changes to 1 or 2 cluster memberships. [sent-54, score-0.266]
27 Asymmetric Approximation: In the sindclus algorithm, Chaturvedi and Carroll (1994) optimize an asymmetric model with two sets of cluster memberships, having the form sij = k wk fik gjk + c. [sent-57, score-0.496]
28 By considering each cluster in turn, this forˆ mulation allows a fast method for determining each of F , G, and w given the other two. [sent-58, score-0.111]
29 Experiments reported below use both a version of the original implementation that has been modified to handle large instances and a reimplemented version (resindclus) that differs in its behavior at boundary cases (handling 0 weights, empty clusters, ties). [sent-60, score-0.078]
30 The weights and constants of each model were optimized using constrained least-squares linear regression (Stark and Parker, 1995), ensuring non-negative cluster weights, and the one with the highest VAF was used. [sent-62, score-0.151]
31 Alternating Clusters: Kiers (1997) proposed an element-wise simplified sindclus algorithm, which we abbreviate as ewindclus. [sent-63, score-0.144]
32 Like sindclus, it considers each cluster in turn, alternating between the weights and the cluster memberships, although only one set of clusters is maintained. [sent-64, score-0.333]
33 Weights are set by a simple regression and memberships are determined by a gradient function that assumes object independence and fixed weights. [sent-65, score-0.269]
34 (A comparison with the original code showed this modification to give equivalent results using less running time. [sent-70, score-0.079]
35 Most published comparisons of additive clustering algorithms use only a small number of test problems (or only artificial data) and report only the best solution found within an unspecified amount of time. [sent-72, score-0.345]
36 Because the algorithms use random starting configurations and often return solutions of widely varying quality even when run repeatedly on the same problem, this leaves it unclear which algorithm gives the best results on a typical run. [sent-73, score-0.201]
37 Furthermore, different Table 2: The performance of several previously proposed algorithms on data sets from psychological experiments. [sent-74, score-0.13]
38 The first set is a collection of 6 typical data sets from psychological experiments that have been used in previous additive clustering work (originally by Shepard and Arabie (1979), except for animals-s, Mechelen and Storms (1995), and animals, Chaturvedi and Carroll (1994)). [sent-80, score-0.339]
39 The number of objects (n) and the number of features used (k) are listed for each instance as part of Table 3. [sent-81, score-0.163]
40 The second set of problems contains noiseless synthetic data derived from A DCLUS models with 8, 16, 32, 64, and 128 objects. [sent-82, score-0.097]
41 In a rough approximation of the human data, the number of clusters was set to 2 log2 (n), and as in previous A DCLUS work, each object was inserted in each cluster with probability 0. [sent-83, score-0.282]
42 A single similarity matrix was generated from each model using weights and constants uniformly distributed between 1 and 6. [sent-85, score-0.126]
43 The third set of problems was derived from the second by adding gaussian noise with a variance of 10% of the variance of the similarity data and enforcing symmetry. [sent-86, score-0.183]
44 Each algorithm was run at least 50 times on each data set. [sent-87, score-0.085]
45 1 1 Depending as it does on running time, this comparison remains imprecise due to variations in the degree of code tuning and the quality of the compilers used, and the need to normalize timings between the multiple machines used in the tests. [sent-90, score-0.075]
46 Summaries of the time-equated results produced by each algorithm on each of the human data sets are shown in Table 2. [sent-91, score-0.141]
47 ) The mean VAF for each algorithm is listed, along with the inter-quartile range (IQR) and the mean number of runs that were necessary to achieve time parity with the slowest algorithm on that data set (r). [sent-93, score-0.151]
48 2 Overall, despite the variety of approaches that have been brought to bear over the years, the original indclus algorithm appears to be the best. [sent-95, score-0.512]
49 (Results in which another algorithm was superior to indclus are marked with a box. [sent-96, score-0.475]
50 It is revealing to note the differences in performance between the original and reimplemented versions of sindclus. [sent-98, score-0.078]
51 Surprisingly, on the synthetic data sets (not shown), the relative performance of the algorithms was quite different, and almost the same on the noisy data as on the noise-free data. [sent-100, score-0.157]
52 (This suggests that the randomly generated data sets that are commonly used to evaluate A DCLUS algorithms do not accurately reflect the problems of interest to practitioners. [sent-101, score-0.118]
53 ) ewindclus performed best here, although it was only occasionally able to recover the original models from the noise-free data. [sent-102, score-0.189]
54 Overall, it appears that current methods of additive clustering are quite sensitive to the type of problem they are run on and that there is little assurance that they can recover the underlying structure in the data, even for small problems. [sent-103, score-0.275]
55 3 A Purely Combinatorial Approach One common theme in indclus, sindclus, and ewindclus is their computation of each cluster and its weight in turn, at each step fitting only the residual similarity not accounted for by the other clusters. [sent-105, score-0.348]
56 This forces memberships to be considered in a predetermined order and allows weights to become obsolete. [sent-106, score-0.219]
57 Instead of computing the elements and weight of each cluster in succession, we first consider all the memberships and then derive all the weights using constrained regression. [sent-108, score-0.33]
58 And where previous algorithms recompute all the memberships of one cluster simultaneously (and therefore independently), we will change memberships one by one in a dynamically determined order using simple heuristic search techniques, recomputing the weights after each step. [sent-109, score-0.661]
59 (An incremental bounded least squares regression algorithm that took advantage of the previous solution would be ideal, but the algorithms tested below did not incorporate such an improvement. [sent-110, score-0.129]
60 ) From this perspective, one need only focus on changing the binary membership variables, and A DCLUS becomes a purely combinatorial optimization problem. [sent-111, score-0.249]
61 The first, indclus-hc, is a simple hill-climbing strategy which attempts to toggle individual memberships in an arbitrary order and the first change resulting in an improved model is accepted. [sent-113, score-0.221]
62 The algorithm terminates when no membership can be changed to give an improvement. [sent-114, score-0.08]
63 In the second algorithm, indclus-pbil, the PBIL algorithm of Baluja (1997) is used 2 Table 3 shows one anomaly: no em-indclus run on animals resulted in a VAF ≥ 0. [sent-116, score-0.143]
64 This also occurred on all synthetic problems with 32 or more objects (although very good solutions were found on the smaller problems). [sent-117, score-0.184]
65 Table 4: The performance of the combinatorial algorithms on human data sets. [sent-119, score-0.274]
66 1 KL Break-Out: A New Optimization Heuristic While the two approaches described above do not use any problem-specific information beyond solution quality, the third algorithm uses the gradient function from the ewindclus algorithm to guide the search. [sent-124, score-0.263]
67 The move strategy is a novel combination of gradient ascent and the classic method of Kernighan and Lin (1970) which we call ‘KL break-out’. [sent-125, score-0.195]
68 It proceeds by gradient ascent, changing the entry in F whose ewindclus gradient points most strongly to the opposite of its current value. [sent-126, score-0.258]
69 When the ascent no longer results in an improvement, a local maximum has been reached. [sent-127, score-0.1]
70 Motivated by results suggesting that good maxima tend to cluster (Boese, Kahng, and Muddu, 1994; Ruml et al. [sent-128, score-0.111]
71 It selects the least damaging variable to change, using the gradient as in the ascent, but now it locks each variable after changing it. [sent-130, score-0.081]
72 To determine if it has escaped, a new gradient ascent is attempted after each locking step. [sent-132, score-0.153]
73 If the ascent surpasses the previous maximum, the current break-out attempt is abandoned and the ascent is pursued. [sent-133, score-0.272]
74 If the break-out procedure changes all variables without any ascent finding a better maximum, the algorithm terminates. [sent-134, score-0.143]
75 The procedure is guaranteed to return a solution at least as good as that found by the original KL method (although it will take longer), and it has more flexibility to follow the gradient function. [sent-135, score-0.09]
76 This algorithm, which we will call ewindclus-klb, surpassed the original KL method in time-equated tests. [sent-136, score-0.078]
77 2 Evaluation of the Combinatorial Algorithms The time-equated performance of the combinatorial algorithms on the human data sets is shown in Table 4, with indclus, the best of the previous algorithms, shown for comparison. [sent-139, score-0.367]
78 As one might expect, adding heuristic guidance to the search helps it enormously: ewindclus-klb surpasses the other combinatorial algorithms on every problem. [sent-140, score-0.347]
79 It performs better than indclus on three of the human data sets (top panel), equals its performance on two, and performs worse on one data set, letters. [sent-141, score-0.53]
80 ) The variance of indclus on letters is very small, and the full distributions suggest that ewindclus-klb is the better choice on this data set if one can afford the time to take the best of 20 runs. [sent-143, score-0.563]
81 (Experiments Table 5: ewindclus-klb and indclus on noisy synthetic data sets of increasing size. [sent-144, score-0.534]
82 ewindclus-klb indclus n VAF IQR VAF IQR r 8 97 96–97 95 93–97 1 16 91 90–92 86 85–87 4 32 90 88–92 83 82–84 22 64 91 90–91 84 84–85 100 128 91 91–91 88 87–90 381 using 7 additional human data sets found that letters represented the weakest performance of ewindclus-klb. [sent-145, score-0.599]
83 ) Performance of a plain KL strategy (not shown) surpassed or equaled indclus on all but two problems (consonants and letters), indicating that the combinatorial approach, in tandem with heuristic guidance, is powerful even without the new ‘KL break-out’ strategy. [sent-146, score-0.765]
84 While we have already seen that synthetic data does not predict the relative performance of algorithms on human data very well, it does provide a test of how well they scale to larger problems. [sent-147, score-0.187]
85 On noise-free synthetic data, ewindclus-klb reliably recovered the original model on all data sets. [sent-148, score-0.105]
86 It was also the best performer on the noisy synthetic data (a comparison with indclus is presented in Table 5. [sent-149, score-0.528]
87 These results show that, in addition to performing best on the human data, the combinatorial approach retains its effectiveness on larger problems. [sent-150, score-0.247]
88 6% worse, does not contain s ˘ in the unvoiced cluster. [sent-153, score-0.103]
89 4 Conclusions We formalized the problem of constructing feature-based representations for cognitive modeling as the unsupervised learning of A DCLUS models from similarity data. [sent-154, score-0.229]
90 In an empirical comparison sensitive to variance in solution quality and computation time, we found that several recently proposed methods for recovering such models perform worse than the original indclus algorithm of Arabie and Carroll (1980). [sent-155, score-0.579]
91 We suggested a purely combinatorial approach to this problem that is simpler than previous proposals, yet more effective. [sent-156, score-0.215]
92 By changing memberships one at a time, it makes fewer independence assumptions. [sent-157, score-0.207]
93 We also proposed a novel variant of the Kernighan-Lin optimization strategy that is able to follow the gradient function more closely, surpassing the performance of the original. [sent-158, score-0.095]
94 While this work has extended the reach of the additive clustering paradigm to large problems, it is directly applicable to feature construction of only those cognitive models whose representations encode similarity as shared features. [sent-159, score-0.423]
95 (The cluster weights can be represented by duplicating strong features or by varying connection weights. [sent-160, score-0.197]
96 ) However, the simplicity of the combinatorial approach should make it straightforward to extend to models in which the absence of features can enhance similarity. [sent-161, score-0.201]
97 A new adaptive multi-start technique for combinatorial global optimizations. [sent-181, score-0.155]
98 An alternating combinatorial optimization approach to fitting the INDCLUS and generalized INDCLUS models. [sent-191, score-0.187]
99 A maximum likelihood method for additive clustering and its applications. [sent-202, score-0.233]
100 A simple method for generating additive clustering models with limited complexity. [sent-217, score-0.233]
wordName wordTfidf (topN-words)
[('vaf', 0.453), ('indclus', 0.432), ('iqr', 0.233), ('dclus', 0.227), ('arabie', 0.185), ('memberships', 0.179), ('combinatorial', 0.155), ('sindclus', 0.144), ('carroll', 0.125), ('ewindclus', 0.124), ('additive', 0.122), ('clustering', 0.111), ('cluster', 0.111), ('phipps', 0.103), ('unvoiced', 0.103), ('ascent', 0.1), ('objects', 0.087), ('similarity', 0.086), ('chaturvedi', 0.082), ('cottrell', 0.082), ('ruml', 0.082), ('shepard', 0.082), ('sij', 0.082), ('letters', 0.069), ('synthetic', 0.068), ('heuristic', 0.066), ('human', 0.064), ('tenenbaum', 0.062), ('representations', 0.062), ('adclus', 0.062), ('clouse', 0.062), ('kiers', 0.062), ('workers', 0.062), ('douglas', 0.061), ('consonants', 0.061), ('animals', 0.058), ('kl', 0.057), ('table', 0.057), ('algorithms', 0.055), ('wk', 0.054), ('josh', 0.054), ('voiced', 0.054), ('gradient', 0.053), ('tting', 0.05), ('features', 0.046), ('lee', 0.046), ('stops', 0.046), ('algorithm', 0.043), ('run', 0.042), ('code', 0.042), ('cognitive', 0.042), ('strategy', 0.042), ('psychological', 0.041), ('anil', 0.041), ('boese', 0.041), ('fik', 0.041), ('garrison', 0.041), ('henk', 0.041), ('hojo', 0.041), ('kahng', 0.041), ('kernighan', 0.041), ('mapclus', 0.041), ('mechelen', 0.041), ('noelle', 0.041), ('reimplemented', 0.041), ('stark', 0.041), ('surpassed', 0.041), ('surpasses', 0.041), ('weights', 0.04), ('modi', 0.039), ('michael', 0.039), ('constructing', 0.039), ('clusters', 0.039), ('name', 0.037), ('membership', 0.037), ('object', 0.037), ('original', 0.037), ('psychometrika', 0.036), ('slowest', 0.036), ('wheeler', 0.036), ('sets', 0.034), ('variance', 0.034), ('quality', 0.033), ('baluja', 0.033), ('nished', 0.033), ('alternating', 0.032), ('previous', 0.031), ('attraction', 0.03), ('listed', 0.03), ('harvard', 0.03), ('guidance', 0.03), ('turn', 0.03), ('asymmetric', 0.03), ('purely', 0.029), ('problems', 0.029), ('runs', 0.029), ('changing', 0.028), ('best', 0.028), ('accounted', 0.027), ('gurations', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
2 0.090887517 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
3 0.090810284 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
4 0.081762105 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
Author: Randall C. O'Reilly, R. S. Busby
Abstract: We present a model of binding of relationship information in a spatial domain (e.g., square above triangle) that uses low-order coarse-coded conjunctive representations instead of more popular temporal synchrony mechanisms. Supporters of temporal synchrony argue that conjunctive representations lack both efficiency (i.e., combinatorial numbers of units are required) and systematicity (i.e., the resulting representations are overly specific and thus do not support generalization to novel exemplars). To counter these claims, we show that our model: a) uses far fewer hidden units than the number of conjunctions represented, by using coarse-coded, distributed representations where each unit has a broad tuning curve through high-dimensional conjunction space, and b) is capable of considerable generalization to novel inputs.
5 0.072113454 185 nips-2001-The Method of Quantum Clustering
Author: David Horn, Assaf Gottlieb
Abstract: We propose a novel clustering method that is an extension of ideas inherent to scale-space clustering and support-vector clustering. Like the latter, it associates every data point with a vector in Hilbert space, and like the former it puts emphasis on their total sum, that is equal to the scalespace probability function. The novelty of our approach is the study of an operator in Hilbert space, represented by the Schr¨ dinger equation of o which the probability function is a solution. This Schr¨ dinger equation o contains a potential function that can be derived analytically from the probability function. We associate minima of the potential with cluster centers. The method has one variable parameter, the scale of its Gaussian kernel. We demonstrate its applicability on known data sets. By limiting the evaluation of the Schr¨ dinger potential to the locations of data points, o we can apply this method to problems in high dimensions.
6 0.06158739 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
7 0.058670133 8 nips-2001-A General Greedy Approximation Algorithm with Applications
8 0.057080511 24 nips-2001-Active Information Retrieval
9 0.053861119 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
10 0.053247314 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
11 0.051291399 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
12 0.051063649 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
13 0.049608737 43 nips-2001-Bayesian time series classification
14 0.046173751 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
15 0.045953229 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
16 0.045491051 46 nips-2001-Categorization by Learning and Combining Object Parts
17 0.045406934 54 nips-2001-Contextual Modulation of Target Saliency
18 0.044310097 170 nips-2001-Spectral Kernel Methods for Clustering
19 0.043926626 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures
20 0.043719139 149 nips-2001-Probabilistic Abstraction Hierarchies
topicId topicWeight
[(0, -0.157), (1, 0.001), (2, -0.001), (3, -0.041), (4, -0.012), (5, -0.057), (6, -0.111), (7, -0.011), (8, -0.023), (9, -0.057), (10, 0.077), (11, -0.028), (12, -0.038), (13, 0.049), (14, -0.041), (15, 0.047), (16, 0.074), (17, -0.041), (18, -0.03), (19, -0.0), (20, 0.077), (21, -0.043), (22, 0.017), (23, 0.011), (24, 0.116), (25, 0.04), (26, 0.036), (27, 0.105), (28, 0.003), (29, -0.07), (30, -0.021), (31, -0.047), (32, 0.017), (33, -0.016), (34, 0.01), (35, -0.054), (36, 0.084), (37, -0.057), (38, -0.012), (39, 0.036), (40, -0.085), (41, 0.082), (42, -0.026), (43, -0.065), (44, -0.011), (45, 0.089), (46, 0.172), (47, 0.053), (48, 0.014), (49, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.91528374 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
2 0.65492588 185 nips-2001-The Method of Quantum Clustering
Author: David Horn, Assaf Gottlieb
Abstract: We propose a novel clustering method that is an extension of ideas inherent to scale-space clustering and support-vector clustering. Like the latter, it associates every data point with a vector in Hilbert space, and like the former it puts emphasis on their total sum, that is equal to the scalespace probability function. The novelty of our approach is the study of an operator in Hilbert space, represented by the Schr¨ dinger equation of o which the probability function is a solution. This Schr¨ dinger equation o contains a potential function that can be derived analytically from the probability function. We associate minima of the potential with cluster centers. The method has one variable parameter, the scale of its Gaussian kernel. We demonstrate its applicability on known data sets. By limiting the evaluation of the Schr¨ dinger potential to the locations of data points, o we can apply this method to problems in high dimensions.
3 0.60663366 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
4 0.54151827 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss
Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1
5 0.48019609 171 nips-2001-Spectral Relaxation for K-means Clustering
Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon
Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1
6 0.45308042 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
7 0.45130816 149 nips-2001-Probabilistic Abstraction Hierarchies
8 0.44916871 30 nips-2001-Agglomerative Multivariate Information Bottleneck
10 0.420836 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
11 0.38687155 24 nips-2001-Active Information Retrieval
12 0.38304725 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
13 0.35594305 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
14 0.34040788 114 nips-2001-Learning from Infinite Data in Finite Time
15 0.3373988 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
16 0.33549827 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
17 0.33311608 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model
18 0.33235678 57 nips-2001-Correlation Codes in Neuronal Populations
19 0.33123118 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
20 0.31300914 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
topicId topicWeight
[(14, 0.018), (17, 0.017), (19, 0.022), (27, 0.09), (30, 0.067), (38, 0.028), (59, 0.018), (72, 0.535), (79, 0.03), (91, 0.097)]
simIndex simValue paperId paperTitle
1 0.98481917 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
same-paper 2 0.9582262 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
3 0.94458699 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
4 0.92823434 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
5 0.82259619 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.7652899 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
7 0.72270012 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.69760084 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
9 0.69632584 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
10 0.68631148 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
11 0.68499023 1 nips-2001-(Not) Bounding the True Error
12 0.66246128 79 nips-2001-Gaussian Process Regression with Mismatched Models
13 0.65752929 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
14 0.64706409 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
15 0.64187777 155 nips-2001-Quantizing Density Estimators
16 0.6404227 61 nips-2001-Distribution of Mutual Information
17 0.63562119 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
18 0.63234568 25 nips-2001-Active Learning in the Drug Discovery Process
19 0.63224459 128 nips-2001-Multiagent Planning with Factored MDPs