jmlr jmlr2012 jmlr2012-60 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer
Abstract: ‘Hubness’ has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of being easy to approximate for large data sets and of having a probabilistic interpretation. Both local and global approaches are shown to be effective especially for high-dimensional data sets, which are affected by high hubness. Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classification accuracy. We evaluate the methods on a large number of public machine learning data sets and synthetic data. Finally we present a real-world application where we are able to achieve significantly higher retrieval quality. Keywords: local and global scaling, shared near neighbors, hubness, classification, curse of dimensionality, nearest neighbor relation
Reference: text
sentIndex sentText sentNum sentScore
1 As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. [sent-10, score-0.67]
2 The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. [sent-11, score-0.371]
3 Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classification accuracy. [sent-14, score-0.412]
4 Keywords: local and global scaling, shared near neighbors, hubness, classification, curse of dimensionality, nearest neighbor relation 1. [sent-17, score-0.323]
5 A direct consequence of the presence of hubs is that a large number of nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. [sent-23, score-0.873]
6 A hub is by definition the nearest neighbor of a large number of objects, but c 2012 Dominik Schnitzer, Arthur Flexer, Markus Schedl and Gerhard Widmer. [sent-24, score-0.413]
7 S CHNITZER , F LEXER , S CHEDL AND W IDMER these objects cannot possibly all be the nearest neighbor of the hub. [sent-25, score-0.372]
8 This observation connects the hub problem to methods that attempt to symmetrize nearest neighbor relations, such as ‘shared near neighbors’ (Jarvis and Patrick, 1973) and ‘local scaling’ (Zelnik-Manor and Perona, 2005). [sent-26, score-0.431]
9 While these methods require knowledge of the local neighborhood of every data point, we propose a global variant that combines the idea of ‘shared near neighbor’ approaches with a transformation of distances to nearest neighbor ‘probabilities’ to define a concept we call Mutual Proximity. [sent-27, score-0.459]
10 To demonstrate the practical relevance, we apply our global scaling algorithm to a real-world music recommendation system and show that doing so significantly improves the retrieval quality. [sent-31, score-0.327]
11 Proper modeling of music similarity is at the heart of many applications involving the automatic organization and processing of music data bases. [sent-36, score-0.345]
12 In Aucouturier and Pachet (2004), hub songs were defined as songs which are, according to an audio similarity function, similar to very many other songs and therefore keep appearing unwontedly often in recommendation lists, preventing other songs from being recommended at all. [sent-37, score-0.737]
13 The existence of the hub problem has also been reported for music recommendation based on collaborative filtering instead of audio content analysis (Celma, 2008). [sent-43, score-0.347]
14 Such points closer to the data-set mean have a high probability of being hubs, that is, of appearing in nearest neighbor lists of many other points. [sent-69, score-0.329]
15 Since hubs appear in very many nearest neighbor lists, they tend to render many nearest neighbor relations asymmetric, that is, a hub y is the nearest neighbor of x, but the nearest neighbor of the hub y is another point a (a = x). [sent-75, score-1.666]
16 This is because hubs are nearest neighbors to very many data points but only k data points can be nearest neighbors to a hub since the size of a nearest neighbor list is fixed. [sent-76, score-1.096]
17 (2010) coined the term bad hubs for c points that show a disagreement of class information for the majority of data points they are nearest neighbors to. [sent-80, score-0.443]
18 Figure 1 illustrates the effect: although a is, in terms of the distance measure, the correct answer to the nearest neighbor query for y, it may be beneficial to use a distance measure that enforces symmetric nearest neighbors. [sent-81, score-0.593]
19 Thus a small distance between two objects should be returned only if their nearest neighbors concur. [sent-82, score-0.378]
20 This links the hub problem to ‘shared near neighbor’ (SNN) approaches, which try to symmetrize nearest neighbor relations. [sent-83, score-0.431]
21 As the name suggests, the shared near neighbor (SNN) similarity is based on computing the overlap between the k nearest neighbors of two objects. [sent-86, score-0.438]
22 It transforms arbitrary distances using the distance between object x and its k’th nearest neighbor (see Section 3. [sent-97, score-0.503]
23 (2010) describe a related method called ‘contextual dissimilarity measure’ 2873 S CHNITZER , F LEXER , S CHEDL AND W IDMER (a) Original nearest neighbor relations (b) Desired nearest neighbor relations Figure 1: Schematic plot of two classes (black/white filled circles). [sent-100, score-0.688]
24 In many classification and retrieval scenarios, (b) would be the desired nearest neighbor relation for the data set. [sent-102, score-0.378]
25 Scaling Methods In the previous section we have seen that (i) the emergence of hubs leads to asymmetric nearest neighbor relations and that (ii) literature already contains hints that local scaling methods seem to improve the situation. [sent-106, score-0.648]
26 Both methods are evaluated in regard to their effects on hubness in Section 4. [sent-111, score-0.431]
27 1 Local Scaling Local scaling (Zelnik-Manor and Perona, 2005) transforms arbitrary distances to so-called affinities (that is, similarities) according to: LS(dx,y ) = exp − dx,y 2 σx σy , (1) where σx denotes the distance between object x and its k’th nearest neighbor. [sent-119, score-0.408]
28 Instead of using the distance to the k’th nearest neighbor to rescale the distances, the average distance of the k nearest neighbors is used. [sent-125, score-0.681]
29 The non-iterative contextual dissimilarity measure (NICDM) transforms distances according to: dx,y NICDM(dx,y ) = √ , µx µy where µx denotes the average distance to the k nearest neighbors of object x. [sent-127, score-0.48]
30 The general idea of MP is to reinterpret the original distance space so that two objects sharing similar nearest neighbors are more closely tied to each other, while two objects with dissimilar neighborhoods are repelled from each other. [sent-134, score-0.471]
31 Objects with similar nearest neighbors are tied together closely, while objects with dissimilar neighbors are repelled. [sent-140, score-0.397]
32 4% of the nearest neighbors (at k = 1) are classified correctly; after applying MP, all (100%) of the nearest neighbors can be classified correctly. [sent-153, score-0.48]
33 Under the assumption that all distances in a data set follow a certain distribution, any distance dx,y can now be reinterpreted as the probability of y being the nearest neighbor of x, given their distance dx,y and the probability distribution P(X). [sent-180, score-0.545]
34 Note that this reinterpretation naturally leads to asymmetric probabilities for a given distance, as the distance distribution estimated for x may be different from the one estimated for y; x might be the nearest neighbor of y but not vice versa. [sent-194, score-0.407]
35 This allows for a convenient way to combine the asymmetric probabilities into a single expression, expressing the probability of x being a nearest neighbor of y and vice versa. [sent-196, score-0.338]
36 Definition 2 We compute the probability that y is the nearest neighbor of x given P(X) (the pdf defined by the distances dx,i=1. [sent-197, score-0.407]
37 n ) and x is the nearest neighbor of y given P(Y ) (the pdf defined by the distances dy,i=1. [sent-200, score-0.407]
38 n d x,y (b) The shaded area shows the probability that y is the nearest neighbor of x based on the distance dx,y and X. [sent-218, score-0.372]
39 µ ˆ to the probability of x being the nearest neighbor of y, IV+III to the probability of y being a nearest neighbor of x and III to their joint probability: II = MP(dx,y ) = 1 − [(I + III) + (IV + III) − III]. [sent-224, score-0.606]
40 While local scaling methods can of course be used with fast nearest neighbor search algorithms indexing the high dimensional spaces, the complexity is far higher than randomly sampling only a constant number of distances. [sent-248, score-0.379]
41 2 K -O CCURRENCE (N k (x)) Defines the k-occurrences of object x, that is, the number of times x occurs in the k nearest neighbor lists of all other objects in the collection. [sent-276, score-0.425]
42 3 H UBNESS (Sk ) We also compute the hubness of the distance space of each collection according to Radovanovi´ c et al. [sent-279, score-0.504]
43 It has also been demonstrated that hubness depends on the intrinsic rather than embedding dimensionality (Radovanovi´ et al. [sent-308, score-0.569]
44 6 P ERCENTAGE OF SYMMETRIC NEIGHBORHOOD RELATIONS We call a nearest neighbor relation between two points x and y ‘symmetric’ if the following holds: object x is a nearest neighbor of y if and only if y is also the nearest neighbor of x. [sent-314, score-0.936]
45 Using the intrinsic dimensionality estimate we can see that there are data sets where the data is originally represented using high-dimensional feature vectors, although the data’s intrinsic dimensionality is quite low. [sent-375, score-0.314]
46 The evaluation results in Tables 1 and 2 are sorted by the hubness Sk=5 of the original distance space (printed in bold). [sent-377, score-0.505]
47 (2010) and the theory that hubness is a consequence c of high dimensionality. [sent-387, score-0.412]
48 By looking at the classification rates (columns Ck=1 and Ck=5 ) it can also clearly be observed that the higher the hubness and intrinsic dimensionality, the more beneficial, in terms of classification accuracy, NICDM and MP. [sent-388, score-0.501]
49 For data sets with high hubness (in the collections we used, a value above 1. [sent-389, score-0.412]
50 Whereas only three changes in accuracy (relative to the original distances) are significant for data sets with low hubness (Sk=5 ≤ 1. [sent-395, score-0.436]
51 4, data sets 1–17), 34 changes in accuracy are significant for data sets with high hubness (Sk=5 > 1. [sent-396, score-0.412]
52 Figures 5 and 6 (left hand sides) present these results in bar plots where the y-axis shows the index of the data sets (ordered according to hubness as in Tables 1 and 2) and the bars show the increase or decrease of classification rates. [sent-399, score-0.412]
53 Another observation from the results listed in Tables 1 and 2 is that both NICDM and MP reduce the hubness of the distance space for all data sets to relatively low values. [sent-406, score-0.481]
54 The hubness Sk=5 decreases from an average value of 2. [sent-407, score-0.412]
55 89 Table 1: Evaluation results ordered by ascending hubness (Sk=5 ) of the original distance space. [sent-655, score-0.505]
56 16 Table 2: Evaluation results ordered by ascending hubness (Sk=5 ) of the original distance space. [sent-917, score-0.505]
57 The impact of MP and NICDM on the hubness per data set is plotted in Figures 5 and 6 (right hand sides). [sent-928, score-0.412]
58 It can be seen that both MP and NICDM lead to lower hubness (measured for Sk=1,5 ) compared to the original distances. [sent-929, score-0.436]
59 The effect is more pronounced for data sets having large hubness values according to the original distances. [sent-930, score-0.436]
60 A notable exception is data set 29 (‘dorothea’) where the reduction in hubness is not so pronounced. [sent-932, score-0.412]
61 The effect of NICDM on IGK is especially unclear for data with low hubness (data sets 1–17). [sent-937, score-0.412]
62 We compare the decrease/increase of classification accuracies and hubness at k = 5. [sent-966, score-0.412]
63 Looking at the results, we can see that all methods seem to perform equally in terms of reducing hubness and increasing classification accuracies. [sent-967, score-0.412]
64 We also notice that with a sample size of S = 30 the decrease in hubness is not as pronounced for MPS as for MP. [sent-982, score-0.412]
65 ) 2 1 0 5 10 k=5 Hubness S Figure 9: Improvements in accuracy (absolute percentage points, significant differences marked with an asterisk) and hubness evaluated with k = 5 for MP (black) and its approximative variant MPS (gray). [sent-996, score-0.441]
66 5 Further Evaluations and Discussion The previous experimental results suggest that the considered distance scaling methods work well as they tend to reduce hubness and improve the classification/retrieval accuracy. [sent-998, score-0.537]
67 1 D IMENSIONALITY As we have already shown that hubs tend to occur in high dimensional spaces, the first experiment examines the consequential question if the scaling methods actually reduce the intrinsic dimensionality of the data. [sent-1012, score-0.416]
68 In order to test this hypothesis, the following simple experiment was performed: We increase the dimensions of artificial data (generated as described above) to create high hubness, and measure the intrinsic dimensionality of the data spaces before and after scaling the distances with NICDM/MP. [sent-1013, score-0.317]
69 In each iteration we measure the hubness of the data and its intrinsic dimensionality. [sent-1015, score-0.501]
70 In Figure 10a we can see that a vector space dimension as low as 30 already leads to a distance space with very high hubness (Sk=5 > 2). [sent-1017, score-0.481]
71 We can further see that NICDM/MP are able to reduce the hubness of the data spaces as expected. [sent-1018, score-0.412]
72 For verification purposes, we (i) also map the original distance space with MDS and (ii) re-compute the hubness for the new data spaces (Figure 11a). [sent-1024, score-0.505]
73 Do hubs, after scaling the distances, still remain hubs (but ‘less severely’ so), or do they stop being hubs altogether? [sent-1031, score-0.462]
74 Measuring (a) hubness of the original and scaled data, (b) the intrinsic dimensionality of the original data. [sent-1037, score-0.617]
75 We define ‘hub’ objects as objects with a k-occurrence in the nearest neighbors greater than 5k and ‘orphan’ objects as having a k-occurrence of zero (k = 5). [sent-1045, score-0.447]
76 Looking at the figure we can confirm that for the two studied cases (hubs/orphans) a weakening of the effects can be observed: after scaling the distances, hubs do not occur as often as nearest neighbors any more, while orphans re-appear in some nearest-neighbor lists. [sent-1047, score-0.592]
77 Another observation is that in no instance of the experiment do hubs become orphans or orphans become hubs, as the measured N k=5 never cross for the two classes. [sent-1049, score-0.351]
78 Orphans re-appear in the nearest neighbor lists and the strength of hubs is reduced. [sent-1055, score-0.532]
79 Badness of an object x is the number of its occurrences as nearest c neighbor at a given k where it appears with a different (that is, ‘bad’) class label. [sent-1062, score-0.33]
80 As this experiment makes only sense in collections with more than one class showing high hubness, we select machine learning data sets with high hubness of Sk=5 > 2 from the 30 previously used databases. [sent-1063, score-0.412]
81 Another visible effect is that orphans re-appear in the nearest neighbor lists (see previous experiment, Figure 12) with an average badness of 36. [sent-1073, score-0.477]
82 6 Summary of Results Our main result is that both global (MP) and local (NICDM) scaling show very beneficial effects concerning hubness on data sets that exhibit high hubness in the original distance space. [sent-1082, score-1.012]
83 4 Table 3: Relative badness (BN k=5 ) of hub objects (N k=5 > 5k), orphan objects (N k=5 = 0), and all other objects. [sent-1166, score-0.353]
84 For data sets exhibiting low hubness in the original distance space, improvements are much smaller or non-existent, but there is no degradation of performance. [sent-1170, score-0.505]
85 By enforcing symmetry in the neighborhood of objects, both methods are able to naturally reduce the occurrence of hubs in nearest neighbor lists. [sent-1172, score-0.538]
86 Interestingly, at the same time as the occurrence of hubs in nearest neighbor lists decreases, hubs also lose their badness in terms of classification accuracy. [sent-1173, score-0.809]
87 The graph visualization displays an incrementally constructed nearest neighbor graph (number of nearest neighbors = 5). [sent-1201, score-0.543]
88 To compute a similarity value between two music tracks (x, y), the method linearly combines rhythmic (dr ) and musical timbre (dt ) similarities into a single general music similarity (d) value. [sent-1204, score-0.435]
89 2 Limitations The above algorithm for computing music similarity creates nearest neighbor lists which exhibit very high hubness. [sent-1209, score-0.525]
90 In the case of the FM4 Soundpark application, which always displays the top-5 nearest neighbors of a song, a similarity space with high hubness has an immediate negative impact on the quality of the results of the application. [sent-1210, score-0.699]
91 High hubness leads to circular recommendations and to the effect that some songs never occur in the nearest neighbor lists at all—hubs push out other objects from the k = 5 nearest neighbors. [sent-1211, score-1.085]
92 As with the machine learning data sets evaluated previously, we observe that the hubness Sk=5 (which is particularly relevant for the application) decreases from 5. [sent-1221, score-0.412]
93 The music genre labels used in this evaluations originate from the artists who uploaded their songs to the Soundpark (see Table 4 for the music genres and their distribution in the collection). [sent-1231, score-0.503]
94 The decrease of hubness produced by MPS leads to a concurrent increase in the reachability of songs in the nearest neighbor graphs. [sent-1232, score-0.869]
95 2% of all songs are reachable via k = 5 nearest neighbor recommendation lists. [sent-1235, score-0.473]
96 The decrease of skewness is clearly visible as the number of songs that are never recommended drops from about 3 000 to 1 500—thus a more even distribution of objects in the nearest neighbors is achieved. [sent-1251, score-0.432]
97 Considerations on the asymmetry of neighbor relations involving hub objects led us to evaluate a recent local scaling method, and to propose a new global variant named ‘mutual proximity’ (MP). [sent-1255, score-0.437]
98 In a comprehensive empirical study we showed that both scaling methods are able to reduce hubness and improve classification accuracy as well as other performance indices. [sent-1256, score-0.468]
99 Our results indicate that both global and local scaling show very beneficial effects concerning hubness on a wide range of diverse data sets. [sent-1283, score-0.507]
100 The music information retrieval evaluation exchange (2005–2007): A window into music information retrieval research. [sent-1376, score-0.448]
wordName wordTfidf (topN-words)
[('nicdm', 0.461), ('mp', 0.459), ('hubness', 0.412), ('hubs', 0.203), ('nearest', 0.152), ('neighbor', 0.151), ('music', 0.149), ('mps', 0.129), ('songs', 0.123), ('hub', 0.11), ('distances', 0.104), ('chedl', 0.098), ('chnitzer', 0.098), ('idmer', 0.098), ('lexer', 0.098), ('caling', 0.092), ('educe', 0.092), ('lobal', 0.092), ('ubs', 0.089), ('intrinsic', 0.089), ('neighbors', 0.088), ('radovanovi', 0.08), ('soundpark', 0.08), ('retrieval', 0.075), ('badness', 0.074), ('orphans', 0.074), ('objects', 0.069), ('distance', 0.069), ('igk', 0.068), ('dimensionality', 0.068), ('ocal', 0.065), ('pace', 0.065), ('scaling', 0.056), ('sk', 0.051), ('recommendation', 0.047), ('similarity', 0.047), ('doi', 0.045), ('dmle', 0.043), ('proximity', 0.042), ('ck', 0.041), ('audio', 0.041), ('uci', 0.04), ('sc', 0.038), ('concordant', 0.037), ('extrinsic', 0.037), ('flexer', 0.037), ('genres', 0.037), ('asymmetric', 0.035), ('neighborhood', 0.032), ('austrian', 0.032), ('relations', 0.031), ('artist', 0.031), ('dorothea', 0.031), ('ois', 0.031), ('orphan', 0.031), ('reachability', 0.031), ('schedl', 0.031), ('libsvm', 0.03), ('mutual', 0.029), ('percentage', 0.029), ('object', 0.027), ('gamma', 0.027), ('fran', 0.026), ('artists', 0.026), ('mirex', 0.026), ('gauss', 0.026), ('lists', 0.026), ('ballroom', 0.025), ('discordant', 0.025), ('dominik', 0.025), ('jegou', 0.025), ('schnitzer', 0.025), ('timbre', 0.025), ('original', 0.024), ('markus', 0.024), ('ismir', 0.024), ('gerhard', 0.024), ('collection', 0.023), ('classi', 0.023), ('public', 0.023), ('mcnemar', 0.021), ('dexter', 0.021), ('song', 0.021), ('cos', 0.021), ('dissimilarity', 0.02), ('contextual', 0.02), ('local', 0.02), ('effects', 0.019), ('genre', 0.019), ('fourclass', 0.018), ('gasser', 0.018), ('jarvis', 0.018), ('musical', 0.018), ('nities', 0.018), ('nmax', 0.018), ('ofai', 0.018), ('pampalk', 0.018), ('skl', 0.018), ('snn', 0.018), ('symmetrize', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space
Author: Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer
Abstract: ‘Hubness’ has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of being easy to approximate for large data sets and of having a probabilistic interpretation. Both local and global approaches are shown to be effective especially for high-dimensional data sets, which are affected by high hubness. Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classification accuracy. We evaluate the methods on a large number of public machine learning data sets and synthetic data. Finally we present a real-world application where we are able to achieve significantly higher retrieval quality. Keywords: local and global scaling, shared near neighbors, hubness, classification, curse of dimensionality, nearest neighbor relation
2 0.056601286 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
Author: Neil D. Lawrence
Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random field via the graphical lasso.
3 0.05457738 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
4 0.043577343 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel
Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor
5 0.043471754 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic
Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods
6 0.036248375 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
7 0.033315904 88 jmlr-2012-PREA: Personalized Recommendation Algorithms Toolkit
8 0.031961076 12 jmlr-2012-Active Clustering of Biological Sequences
9 0.031176791 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
10 0.02494148 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
11 0.022873582 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
12 0.022495097 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
13 0.020709464 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
14 0.01901447 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
15 0.018953694 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
16 0.018417278 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
17 0.01823497 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
18 0.017315285 4 jmlr-2012-A Kernel Two-Sample Test
19 0.017305508 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
20 0.017299328 101 jmlr-2012-SVDFeature: A Toolkit for Feature-based Collaborative Filtering
topicId topicWeight
[(0, -0.094), (1, 0.029), (2, 0.098), (3, 0.023), (4, -0.041), (5, 0.01), (6, 0.002), (7, 0.022), (8, 0.002), (9, -0.057), (10, -0.019), (11, 0.016), (12, -0.046), (13, -0.024), (14, 0.009), (15, 0.078), (16, 0.041), (17, -0.041), (18, 0.031), (19, 0.064), (20, 0.027), (21, 0.076), (22, 0.131), (23, -0.108), (24, -0.084), (25, 0.094), (26, -0.119), (27, -0.037), (28, -0.217), (29, -0.12), (30, 0.081), (31, 0.049), (32, 0.062), (33, -0.088), (34, -0.056), (35, -0.334), (36, 0.167), (37, -0.087), (38, -0.192), (39, 0.177), (40, -0.006), (41, 0.005), (42, -0.174), (43, 0.048), (44, -0.028), (45, -0.183), (46, -0.203), (47, -0.162), (48, 0.005), (49, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.96424782 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space
Author: Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer
Abstract: ‘Hubness’ has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of being easy to approximate for large data sets and of having a probabilistic interpretation. Both local and global approaches are shown to be effective especially for high-dimensional data sets, which are affected by high hubness. Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classification accuracy. We evaluate the methods on a large number of public machine learning data sets and synthetic data. Finally we present a real-world application where we are able to achieve significantly higher retrieval quality. Keywords: local and global scaling, shared near neighbors, hubness, classification, curse of dimensionality, nearest neighbor relation
2 0.45732909 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
Author: Neil D. Lawrence
Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random field via the graphical lasso.
3 0.33585331 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic
Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods
4 0.28382966 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
5 0.28211248 12 jmlr-2012-Active Clustering of Biological Sequences
Author: Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia
Abstract: Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points. In our model we assume that we have access to one versus all queries that given a point s ∈ S return the distances between s and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries. Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification. Keywords: clustering, active clustering, k-median, approximation algorithms, approximation stability, clustering accuracy, protein sequences ∗. A preliminary version of this article appeared under the title Efficient Clustering with Limited Distance Information in the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Corvallis, Oregon, 632-641. †. Most of this work was completed at Boston University. c 2012 Konstantin Voevodski, Maria-Florina Balcan, Heiko R¨ glin, Shang-Hua Teng and Yu Xia. o ¨ VOEVODSKI , BALCAN , R OGLIN , T ENG AND X IA
6 0.22236678 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
7 0.22187129 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification
9 0.21628156 88 jmlr-2012-PREA: Personalized Recommendation Algorithms Toolkit
10 0.2154026 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
11 0.18734474 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
12 0.18142146 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
13 0.17113441 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
14 0.16494705 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization
15 0.15542857 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
16 0.1518053 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
17 0.14048298 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
18 0.13857916 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
19 0.13824511 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
20 0.13318112 109 jmlr-2012-Stability of Density-Based Clustering
topicId topicWeight
[(7, 0.015), (14, 0.482), (21, 0.027), (26, 0.039), (27, 0.016), (29, 0.042), (35, 0.029), (49, 0.014), (56, 0.018), (69, 0.015), (75, 0.039), (77, 0.018), (79, 0.016), (81, 0.012), (92, 0.057), (96, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.76998097 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space
Author: Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer
Abstract: ‘Hubness’ has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of being easy to approximate for large data sets and of having a probabilistic interpretation. Both local and global approaches are shown to be effective especially for high-dimensional data sets, which are affected by high hubness. Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classification accuracy. We evaluate the methods on a large number of public machine learning data sets and synthetic data. Finally we present a real-world application where we are able to achieve significantly higher retrieval quality. Keywords: local and global scaling, shared near neighbors, hubness, classification, curse of dimensionality, nearest neighbor relation
2 0.64505637 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
Author: Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio
Abstract: Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classification problems, namely semi-supervised and multitask learning. Keywords: restricted Boltzmann machine, classification, discriminative learning, generative learning
3 0.2546244 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle
Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning
4 0.25378713 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel
Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor
5 0.23112437 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
6 0.23047382 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
7 0.228552 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
8 0.22843543 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
9 0.22586671 103 jmlr-2012-Sampling Methods for the Nyström Method
10 0.22558513 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
11 0.22386801 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
12 0.22321256 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
13 0.22220114 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
14 0.22186098 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
15 0.22120495 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
16 0.22086412 80 jmlr-2012-On Ranking and Generalization Bounds
17 0.22028412 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
18 0.22027588 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
19 0.2202598 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
20 0.21985187 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices