jmlr jmlr2012 jmlr2012-60 jmlr2012-60-reference knowledge-graph by maker-knowledge-mining

60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

Source: pdf

Author: Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer

Abstract: ‘Hubness’ has recently been identiﬁed as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of being easy to approximate for large data sets and of having a probabilistic interpretation. Both local and global approaches are shown to be effective especially for high-dimensional data sets, which are affected by high hubness. Both methods lead to a strong decrease of hubness in these data sets, while at the same time improving properties like classiﬁcation accuracy. We evaluate the methods on a large number of public machine learning data sets and synthetic data. Finally we present a real-world application where we are able to achieve signiﬁcantly higher retrieval quality. Keywords: local and global scaling, shared near neighbors, hubness, classiﬁcation, curse of dimensionality, nearest neighbor relation

reference text

Charu Aggarwal, Alexander Hinneburg, and Daniel Keim. On the surprising behavior of distance metrics in high dimensional space. In Database Theory - ICDT 2001, Lecture Notes in Computer Science, pages 420–434. Springer Berlin/Heidelberg, 2001. doi: 10.1007/3-540-44503-X_27. Jean-Julien Aucouturier and François Pachet. Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1), 2004. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. 13. The scripts can be found at http://www.ofai.at/~dominik.schnitzer/mp. 2899 S CHNITZER , F LEXER , S CHEDL AND W IDMER Richard Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. Kristin P. Bennett, Usama Fayyad, and Dan Geiger. Density-based indexing for approximate nearest-neighbor queries. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), KDD ’99, pages 233–243, New York, NY, USA, 1999. ACM. ISBN 1-58113-143-7. doi: 10.1145/312129.312236. Adam Berenzweig. Anchors and Hubs in Audio-based Music Similarity. PhD thesis, Columbia University, 2007. James C. Bezdek and Nihil R. Pal. Some new indexes of cluster validity. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 28(3):301 –315, 1998. ISSN 10834419. doi: 10.1109/3477.678624. Rui Cai, Chao Zhang, Lei Zhang, and Wei-Ying Ma. Scalable music recommendation by search. In Proceedings of the 15th International Conference on Multimedia, pages 1065–1074, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-702-5. doi: 10.1145/1291233.1291466. Michael Casey, Christophe Rhodes, and Malcolm Slaney. Analysis of minimum distances in highdimensional musical spaces. IEEE Transactions on Audio, Speech, and Language Processing, 16 (5):1015–1028, 2008. doi: 10.1109/TASL.2008.925883. Òscar Celma. Music Recommendation and Discovery in the Long Tail. PhD thesis, Universitat Pompeu Fabra, Barcelona, 2008. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogrambased image classiﬁcation. IEEE Transactions on Neural Networks, 10(5):1055 –1064, 1999. ISSN 1045-9227. doi: 10.1109/72.788646. Thomas G. Dietterich. Approximate statistical tests for comparing supervised classiﬁcation learning algorithms. Neural computation, 10(7):1895–1923, 1998. doi: 10.1162/089976698300017197. George Doddington, Walter Liggett, Alvin Martin, Mark Przybocki, and Douglas Reynolds. Sheep, goats, lambs and wolves a statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP-98), 1998. J. Stephen Downie. The music information retrieval evaluation exchange (2005–2007): A window into music information retrieval research. Acoustical Science and Technology, 29(4):247–255, 2008. Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, 2003. Andras Ferencz, Erik G. Learned-Miller, and Jitendra Malik. Building a classiﬁcation cascade for visual identiﬁcation from one example. In Proceedings of the 10th IEEE International Conference on Computer Vision, volume 1, pages 286–293. IEEE, 2005. doi: 10.1109/ICCV.2005.52. 2900 L OCAL AND G LOBAL S CALING R EDUCE H UBS IN S PACE Arthur Flexer, Dominik Schnitzer, Martin Gasser, and Tim Pohle. Combining features reduces hubness in audio similarity. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), 2010. Damien François, Vincent Wertz, and Michel Verleysen. The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19:873–886, 2007. ISSN 1041-4347. doi: 10.1109/TKDE.2007.1037. Andrew Frank and Arthur Asuncion. UCI machine learning repository, 2010. Repository located at: http://archive.ics.uci.edu/ml. Martin Gasser and Arthur Flexer. FM4 SoundPark: Audio-based music recommendation in everyday use. In Proceedings of the 6th Sound and Music Computing Conference, pages 23–25, 2009. Simon Günter and Horst Bunke. Validation indices for graph clustering. Pattern Recognition Letters, 24(8):1107 – 1113, 2003. ISSN 0167-8655. doi: DOI:10.1016/S0167-8655(02)00257-X. Graph-based Representations in Pattern Recognition. Austin Hicklin, Craig Watson, and Brad Ulery. The Myth of Goats: How Many People have Fingerprints that are Hard to Match? US Dept. of Commerce, National Institute of Standards and Technology (NIST), 2005. R.A. Jarvis and Edward A. Patrick. Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, 22:1025–1034, 1973. ISSN 0018-9340. doi: 10.1109/T-C.1973.223640. Herve Jegou, Hedi Harzallah, and Cordelia Schmid. A contextual dissimilarity measure for accurate and efﬁcient image search. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007. doi: 10.1109/CVPR.2007.382970. Herve Jegou, Cordelia Schmid, Hedi Harzallah, and Jakob Verbeek. Accurate image search using the contextual dissimilarity measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 2010. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.285. Wen Jin, Anthony Tung, Jiawei Han, and Wei Wang. Ranking outliers using symmetric neighborhood relationship. In Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, pages 577–593. Springer Berlin/Heidelberg, 2006. doi: 10.1007/11731139_68. Ioannis Karydis, Miloš Radovanovi´ , Alexandros Nanopoulos, and Mirjana Ivanovi´ . Looking c c through the