LNCS 9939 Laurent Amsaleg Michael E Houle Erich Schubert (Eds.) Similarity Search and Applications 9th International Conference, SISAP 2016 Tokyo, Japan, October 24–26, 2016 Proceedings 123 Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany 9939 More information about this series at http://www.springer.com/series/7409 Laurent Amsaleg Michael E Houle Erich Schubert (Eds.) • Similarity Search and Applications 9th International Conference, SISAP 2016 Tokyo, Japan, October 24–26, 2016 Proceedings 123 Editors Laurent Amsaleg CNRS–IRISA Rennes France Erich Schubert Ludwig-Maximilians-Universität München Munich Germany Michael E Houle National Institute of Informatics Tokyo Japan ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-319-46758-0 ISBN 978-3-319-46759-7 (eBook) DOI 10.1007/978-3-319-46759-7 Library of Congress Control Number: 2016954121 LNCS Sublibrary: SL3 – Information Systems and Applications, incl Internet/Web, and HCI © Springer International Publishing AG 2016 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface This volume contains the papers presented at the 9th International Conference on Similarity Search and Applications (SISAP 2016) held in Tokyo, Japan, during October 24–26, 2016 SISAP is an annual forum for researchers and application developers in the area of similarity data management It aims at the technological problems shared by numerous application domains, such as data mining, information retrieval, multimedia, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that make use of similarity search as a necessary supporting service From its roots as a regional workshop in metric indexing, SISAP has expanded to become the only international conference entirely devoted to the issues surrounding the theory, design, analysis, practice, and application of content-based and feature-based similarity search The SISAP initiative has also created a repository (http://www.sisap org/) serving the similarity search community, for the exchange of examples of realworld applications, source code for similarity indexes, and experimental test beds and benchmark data sets The call for papers welcomed full papers, short papers, as well as demonstration papers, with all manuscripts presenting previously unpublished research contributions At SISAP 2016, all contributions were presented both orally and in a poster session, which facilitated fruitful exchanges between the participants We received 47 submissions, 32 full papers and 15 short papers, from authors based in 21 different countries The Program Committee (PC) was composed of 62 members from 26 countries Reviews were thoroughly discussed by the chairs and PC members: each submission received at least three to five reviews, with additional reviews sometimes being sought in order to achieve a consensus The PC was assisted by 23 external reviewers The final selection of papers was made by the PC chairs based on the reviews received for each submission as well as the subsequent discussions among PC members The final conference program consisted of 18 full papers and seven short papers, resulting in an acceptance rate of 38 % for full papers and 53 % cumulative for full and short papers The proceedings of SISAP are published by Springer as a volume in the Lecture Notes in Computer Science (LNCS) series For SISAP 2016, as in previous years, extended versions of five selected excellent papers were invited for publication in a special issue of the journal Information Systems The conference also conferred a Best Paper Award, as judged by the PC Co-chairs and Steering Committee The conference program and the proceedings are organized in several parts As a first part, the program includes three keynote presentations from exceptionally skilled scientists: Alexandr Andoni, from Columbia University, USA, on the topic of “Data-Dependent Hashing for Similarity Search”; Takashi Washio, from the University of Osaka, Japan, on “Defying the Gravity of Learning Curves: Are More Samples VI Preface Better for Nearest Neighbor Anomaly Detectors?”; and Zhi-Hua Zhou, from Nanjing University, China, on “Partial Similarity Match with Multi-instance Multi-label Learning” The program then carries on with the presentations of the papers, grouped in eight categories: graphs and networks; metric and permutation-based indexing; multimedia; text and document similarity; comparisons and benchmarks; hashing techniques; timeevolving data; and scalable similarity search We would like to thank all the authors who submitted papers to SISAP 2016 We would also like to thank all members of the PC and the external reviewers for their effort and contribution to the conference We want to express our gratitude to the members of the Organizing Committee for the enormous amount of work they have done We also thank our sponsors and supporters for their generosity All the submission, reviewing, and proceedings generation processes were carried out through the EasyChair platform August 2016 Laurent Amsaleg Michael E Houle Erich Schubert Organization Program Committee Chairs Laurent Amsaleg Michael E Houle CNRS-IRISA, France National Institute of Informatics, Japan Program Committee Members Giuseppe Amato Laurent Amsaleg Hiroki Arimura Ira Assent James Bailey Christian Beecks Panagiotis Bouros Leonid Boytsov Benjamin Bustos K Selỗuk Candan Guang-Ho Cha Edgar Chávez Paolo Ciaccia Richard Connor Michel Crucianu Bin Cui Vlad Estivill-Castro Andrea Esuli Fabrizio Falchi Claudio Gennaro Magnus Lie Hetland Michael E Houle Yoshiharu Ishikawa Bjưrn Þór Jónsson Ata Kabán Ken-ichi Kawarabayashi Daniel Keim Yiannis Kompatsiaris Peer Kröger Guoliang Li Jakub Lokoč ISTI-CNR, Italy CNRS-IRISA, France Hokkaido University, Japan Aarhus University, Denmark University of Melbourne, Australia RWTH Aachen University, Germany Aarhus University, Denmark Carnegie Mellon University, USA University of Chile, Chile Arizona State University, USA Seoul National University of Science and Technology, Korea CICESE, Mexico University of Bologna, Italy University of Strathclyde, UK CNAM, France Peking University, China Griffith University, Australia ISTI-CNR, Italy ISTI-CNR, Italy ISTI-CNR, Italy NTNU, Norway National Institute of Informatics, Japan Nagoya University, Japan Reykjavik University, Iceland University of Birmingham, UK National Institute of Informatics, Japan University of Konstanz, Germany CERTH – ITI, Greece Ludwig-Maximilians-Universität München, Germany Tsinghua University, China Charles University in Prague, Czech Republic VIII Organization Rui Mao Stéphane Marchand-Maillet Henning Müller Gonzalo Navarro Chong-Wah Ngo Beng Chin Ooi Vincent Oria M Tamer Özsu Deepak P Apostolos N Papadopoulos Marco Patella Oscar Pedreira Miloš Radovanović Kunihiko Sadakane Shin’ichi Satoh Erich Schubert Tetsuo Shibuya Yasin Silva Matthew Skala John Smith Nenad Tomašev Agma Traina Takeaki Uno Michel Verleysen Takashi Washio Marcel Worring Pavel Zezula De-Chuan Zhan Zhi-Hua Zhou Arthur Zimek Andreas Züfle Shenzhen University, China Viper Group - University of Geneva, Switzerland HES-SO, Switzerland University of Chile, Chile City University of Hong Kong, SAR China National University of Singapore, Singapore New Jersey Institute of Technology, USA University of Waterloo, Canada IBM Research, India Aristotle University of Thessaloniki, Greece DEIS – University of Bologna, Italy Universidade da Coruña, Spain University of Novi Sad, Serbia The University of Tokyo, Japan National Institute of Informatics, Japan Ludwig-Maximilians-Universität München, Germany Human Genome Center, Institute of Medical Science, The University of Tokyo, Japan Arizona State University, USA IT University of Copenhagen, Denmark IBM T.J Watson Research Center, USA Google, UK University of São Paulo at São Carlos, Brazil National Institute of Informatics, Japan Université Catholique de Louvain, Belgium ISIR, Osaka University, Japan University of Amsterdam, The Netherlands Masaryk University, Czech Republic Nanjing University, China Nanjing University, China Ludwig-Maximilians-Universität München, Germany George Mason University, USA Additional Reviewers Tetsuya Araki Konstantinos Avgerinakis Nicolas Basset Michal Batko Jessica Beltran Hei Chan Elisavet Chatzilari Anh Dinh Alceu Ferraz Costa Karina Figueroa David Novak Ninh Pham Nora Reyes José Fernando Rodrigues Jr Ubaldo Ruiz Manos Schinas Pascal Schweitzer Diego Seco Francesco Silvestri Eleftherios Spyromitros-Xioufis Eric S Tellez Xiaofei Zhang Yue Zhu Keynotes Similarity Search of Sparse Histograms on GPU Architecture Hasmik Osipyan1,2(B) , Jakub Lokoˇc2 , and St´ephane Marchand-Maillet3 National Polytechnic University of Armenia, Yerevan, Armenia hasmik.osipyan.external@worldline.com SIRET Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic lokoc@ksi.ms.mff.cuni.cz University of Geneva, Geneva, Switzerland stephane.marchand-maillet@unige.ch Abstract Searching for similar objects within large-scale database is a hard problem due to the exponential increase of multimedia data The time required to find the nearest objects to the specific query in a highdimensional space has become a serious constraint of the searching algorithms One of the possible solution for this problem is utilization of massively parallel platforms such as GPU architectures This solution becomes very sensitive for the applications working with sparse dataset The performance of the algorithm can be totally changed depending on the different sparsity settings of the input data In this paper, we study four different approaches on the GPU architecture for finding the similar histograms to the given queries The performance and efficiency of observed methods were studied on sparse dataset of half a million histograms We summarize our empirical results and point out the optimal GPU strategy for sparse histograms with different sparsity settings Keywords: GPU dataset · Similarity search · High-dimensional space · Sparse Introduction Similarity search in high-dimensional data [27] is a frequently used operation in many areas like multimedia retrieval/exploration, machine learning, computer vision etc In order to perform similarity search, objects from a particular dataset have to be transformed into a descriptor space U with a distance function assigning a similarity score for two descriptors (smaller distance, higher similarity, and vice versa) The descriptors are often modeled as vectors (histograms) in Rm while the similarity function between vectors o, q ∈ Rm is usually modeled by m means of the Euclidean distance L2 (o, q) = i=1 (oi − qi ) One of the most popular similarity operations is the kN N query defined for k ∈ N+ , a query object x ∈ Rm and a dataset D ⊂ Rm as: kN N (x) = {X ⊂ D; |X| = k ∧ ∀y ∈ X, z ∈ D − X : L2 (x, y) ≤ L2 (x, z)} In some scenarios, the dataset D and a set c Springer International Publishing AG 2016 L Amsaleg et al (Eds.): SISAP 2016, LNCS 9939, pp 325–338, 2016 DOI: 10.1007/978-3-319-46759-7 25 326 H Osipyan et al of query objects Q are both available in advance and the task is to evaluate all the kN N queries within a limited time period (e.g online kN N classification) For finding the nearest objects from D to the given query qi ∈ Q, the sequential search approach can be used Here, the distances between the query qi and all objects in D are computed Then, the distances are sorted and the nearest objects are taken Whereas the sequential search has been outperformed by various indexing/hashing techniques on classical CPU architectures [8,27,30], novel many core GPU architectures [24] cause renaissance of sequential searching as it constitutes a trivial data parallel problem efficiently applicable on a commodity hardware Furthermore, brute force kN N search is more robust against high dimensionality of vectors and high number of required nearest objects [16] In this paper, we investigate brute force kN N search in a dataset consisting of sparse vectors, focusing on various sparsity settings of the dataset We assume that the query objects are collected dynamically (e.g online malware detection in persistent web traffic [12]) and they have to be processed within a short time interval, preventing from additional vector space transformation operations Although additional optimizations considering compact representations of sparse vectors may result in more efficient performance on a CPU platform (see Algorithm in Sect 4), the same optimizations may suffer from novel GPU designs and specifics Therefore, we revisit brute force kN N sequential search in sparse vector datasets and confront compact form representation with GPUs The rest of the paper is organized as follows In Sect 2, we review the literature related to similarity search algorithms on GPUs Section revises the GPU architecture fundamentals The analysis of the similarity search algorithm along with the proposed methods are summarized in Sect In Sect 5, we describe our experimental setup and perform the empirical evaluation Section concludes the paper Related Work In the literature, many approaches were proposed to improve both exact and approximate similarity search algorithms For obtaining best speedup, some of these methods even use multi-core CPUs or heterogeneous systems based on the GPUs In this section, we review the most recent parallel approaches of similarity search algorithms as well as their individual steps for sparse dataset Matsumoto et al [20] presented new exact kN N search algorithm based on the partial heap sort Implementing distance calculation on the GPU combined with the fast heap sorting on the CPU, authors achieved better performance compared to the existing methods on GPUs In general, this performance is the result of the new heap sort method based on the minimal overhead threshold and compression that outperformed even the sorting algorithms on the GPU For approximate similarity search on GPUs Krulis et al [13] showed good performance for permutation-based indexing algorithm This algorithm based on the sequential indexing was presented in the work of Mohammed et al [21] In this case, except distance calculation on GPU [14], the postprocessing steps of Similarity Search of Sparse Histograms on GPU Architecture 327 obtained distances were implemented on GPUs as well That includes selection of top-k nearest objects and their sorting with bitonic sort algorithm Another approach of approximate similarity search on GPUs was suggested by Teodoro et al [28] A new parallel framework, Hypercurves, was able to answer approximate kN N queries with high speed Based on the filter-stream programming paradigm this method divided the dataset into partitions giving the opportunity to access them independently in a parallel manner Then kN N run on the GPU Several papers described different kN N approaches on the GPU architectures [7,17,26] However, the method in the paper [28] outperformed previous approaches by using heaps for selection procedure of top-k points This dynamic partitioning along with kN N implementation reduced query responses approximately 80× compared to sequential version In [16], authors suggested a parallel approach of brute-force kN N for multiple queries Here, distance matrix calculation was based on the standard dot product calculation on GPUs Each portion of the divided distance matrix was computed by a block of threads Then merge sort was implemented on the GPU to sort each portion parallel and the final k points were obtained after merging Several works present GPU approaches for individual steps of similarity search algorithms as a standalone problem Chang et al [3] obtained 40× faster results on GPU hardware for pairwise distance calculation The authors presented two different implementations In the first approach, they used 1D grid and 1D block and the threads were organized into blocks of 256 In this case, shared memory was used to process one row of the data matrix and to calculate its distances to the 256 rows corresponding to the threads in the block In the second approach, authors used 16 × 16 threads in each blocks where one thread computed one entry in the output This led to better performance than the first implementation Li et al [15] suggested another way of Euclidean distance calculation on GPUs which achieved approximately 15× speedup for a dataset comprising million objects Authors used map-reduce technique to split up the final distance matrix into smaller ones Then, the partial distance matrices were calculated on the GPU and the final solution was obtained after merging All discussed works described different algorithms of similarity search on parallel architectures However, none of them solved the performance issues arising in the applications working with sparse datasets [9] Different methods were suggested to learn efficiently similarity measure of sparse dataset [4,18,29] Nevertheless, the main constraint of these methods remains the computation time A few papers discuss the suitability of GPUs for the applications of sparse data model but all of them are concentrated on the sparse matrix multiplication In their continuous work, Neelima et al [22,23] presented different sparse matrix formats that increased application performance with respect to GPU For example, in one representation, they have defined two data structures for nonempty elements, one for data itself and the second for column and row indexes For the row wise computation values, the better performance is achieved by using the latency hiding mechanism of the GPU Another approach of sparse matrix vector multiplication was suggested by Ashari et al [2] In this approach, 328 H Osipyan et al the non-empty elements of the rows were grouped into the constant number of blocks, which helped to reduce the thread divergence Liu et al [19] provided another storage format, Compressed Sparse Row (CSR5), for sparse matrix This format is an extension of CSR (Compressed Sparse Row) including the avoiding of structure dependent parameter tuning and the applicability for regular and irregular matrices In the paper, for sparse matrix vector multiplication [5] the standard segmented sum algorithm was redesigned by prefix-sum scan For sparse data processing, the performance of discussed methods strongly depends on the sparsity of the input data Hence, the implementation of applications for sparse dataset on heterogeneous systems needs the careful understanding of underlying architecture and the usage of the right data format GPU Architecture In this section, we present the basics of GPU architecture with particular emphasis on the aspects, which have great importance in the light of the studied problem We will focus mainly on the NVIDIA Kepler [24] architecture as it was used in our experiments A GPU card is a peripheral device connected to the host system via the PCIExpress (PCIe) bus It consists of several streaming multiprocessor units (SMPs), which share only the main memory bus and the L2 cache The GPUs employ a parallel paradigm called data parallelism where the concurrency is achieved by processing multiple data items simultaneously by the same routine (kernel ) Each thread executes the kernel code, but has a unique thread ID, which is used to identify the portion of the work processed by the thread The threads are grouped together into blocks of the same size Threads from different blocks are not allowed to communicate with each other directly, since it is not even guaranteed that any two blocks will be executed concurrently Furthermore, threads in a block are divided into subgroups (warps) The number of threads in warps is usually fixed for each architecture (current NVIDIA GPUs use 32 threads per warp) Fig Host and GPU memory organization scheme Similarity Search of Sparse Histograms on GPU Architecture 329 The other main difference from CPU architecture is the memory organization which is depicted in Fig The host memory is the operational memory of the computer, which cannot be accessed by GPU At first input data needs to be transferred from the host memory (RAM) to the GPU memory (VRAM) via PCI-Express, which is rather slow (8 GB/s) when compared to the internal memory buses The global memory can be accessed from the GPU cores, and it shows both high latency and limited bandwidth The shared memory is shared among threads within one group It is rather small (tens of kB) but almost as fast as the GPU registers The shared memory can play the role of a programmanaged cache for the global memory, or it can be used to exchange intermediate results by the threads in the block The private memory belongs exclusively to a single thread and corresponds to the GPU core registers Its size is very limited; therefore, it is suitable just for a few local variables The L2 cache is shared by all SMPs and transparently caches all access to global memory The L1 cache is private to each SMP and caches data from global memory selectively Another important issue on GPUs is the branching problems When threads in a warp choose different code branches (if or while statements), all branches must be executed by all threads Thread masks instruction execution according to their local conditions to ensure correct results, but heavily branched code does not perform well on GPUs Two programming techniques were proposed to work directly with GPU hardware: Compute Unified Device Architecture (CUDA) developed by NVIDIA [25] and Open Computing Language (OpenCL) developed by Khronos Group [11] Although for some applications OpenCL can be a good alternative to CUDA [6], it is shown that CUDA is the best choice for high performance needs [10] Hence, in our research, we will base on the CUDA technique Searching for Nearest Sparse Histograms In our work, we consider sparse vectors that can be represented in a compact form as an array of pairs [ID, value] Given such compact representation, the Euclidean distance evaluation can be implemented efficiently for CPU processing considering only non-empty bins as presented in Algorithm In this case, the final result is obtained from the sum of distances for the values with the same IDs and the squares of remaining values However, when considering GPU architectures, the distance evaluation, consisting of many if −statements, may represent a new performance bottleneck, despite lower memory requirements Therefore, we revisit kN N sequential search in sparse vector datasets and confront compact form representation with GPU architectures Note that the sequential search is quite memory-intensive, so it has to be implemented in a cache-aware manner to achieve optimal performance For a sparse dataset, the performance of distance calculation on GPU depends also on the internal sparsity settings of given data Hence, we need to consider the scope of individual parameters of used dataset, so that we can optimize technical details of our implementation We examine four possible approaches - a conditional solution (CDS) based on the if-statements, naive solution (NS) based on the data division, compressing 330 H Osipyan et al Algorithm Distance for compact sparse histograms qj ∈ Q and oi ∈ D 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: d = 0, kQ = 0, kD = while kQ < qj length ∧ kD < oi length if qj [kQ ].binId == oi [kD ].binId then d += (qj [kQ ].value − oi [kD ].value)2 , kQ += 1, kD += else if qj [kQ ].binId < oi [kD ].binId then d += (qj [kQ ].value)2 , kQ += else d += (oi [kD ].value)2 , kD += end if end while while kQ < qj length d += (qj [kQ ].value)2 , kQ += end while while kD < oi length d += (oi [kD ].value)2 , kD += end while (d) return solution (CS) based on the compressing of query and object data and finally, column-based solution (CBS) inspired by inverted files CDS method is the baseline GPU implementation of Algorithm where the performance suffers from the branching problems This method shows worse performance compared even with the standard CPU-only solution Hence, in the next subsections we will only explain the NS, CS and CBS methods As a baseline CPU solutions, we consider standard solution (STS ) based on the if-statements and the inverted files solution (IFS ) [1] In IFS, the object database is represented by the list of values with the same IDs, which accelerates the retrieving time of the values for each query 4.1 Naive Solution To solve the issues of CDS, we have adjusted Algorithm to better use the architecture of GPU We proposed to keep if-statements on the CPU during the reading of the data and send already arranged data on the GPU Hence, we have two kernels on GPU The first one is responsible for distance calculation of two arrays containing only the bins with the same ID The second one calculates the Euclidean norm of the array, which contains the remaining points from query and data object Then, the results of our two kernels are merged on the CPU side Simple data division example is shown in Fig This method is based on two dimensional thread organization Shared memory (48 kB) was used to cache the query data points The cached data are associated with y thread grid dimension while the non-cached data are addressed by x coordinates Different streams are used for two kernels and the results of each kernel are kept on the CPU while processing the next portion of the data This helps to avoid the synchronization between two kernels as the results are merged Similarity Search of Sparse Histograms on GPU Architecture 331 Fig Data arrangement for NS in the final stage Each thread/block size was tested for different configurations and the efficient configuration was selected in the final results Despite the avoidance of warp divergence in this approach, the division of two kernels may reduce the performance Depending on the number of bins with the same IDs, the time required for transferring the data between CPU and GPU can exceed the time spent on the operation itself, therefore, increasing the total computation time Theoretically, in the worst case, when there is no bin with the same ID, the time complexity required for reading from file/data arrangement is equal to O(kQ +kD ) and another O(kQ +kD ) for Euclidean norm calculation, which leads to total O(2kQ + 2kD ) Here, kD and kQ are the number of non-empty bins of object and query respectively In the average case, when the number of the bins with the same bin ID is p, the time complexity becomes O(2kQ + kD − p)(O(kQ + 2kD − p) in case where kD > kQ ) 4.2 Compressing Solution To avoid data division, the query or database objects can be changed in the way to fill out the missing bin IDs with values Hence, for each query/database object the array with size of equal to the (last bin ID - first bin ID) value can be created Although this approach avoids if-statements in the final distance calculations, it is very sensitive to sparse dataset itself The higher the value of the (last bin ID - first bin ID), the more memory is used for keeping all corresponding values Considering the small amount of available memory on GPU hardware, this approach limits the number of simultaneously processed histograms, which affects on the final performance of the algorithm To make the approach less sensitive to the value of bin IDs, a new compressing solution (CS ) was proposed In the distance calculation, all if-statements were avoided by compressing the query/object data based on the bin ID information Our simple data compressing example is shown in Fig 332 H Osipyan et al Fig Compressing for CS The bins, which ID exists at least in one histogram, are processed by adding values in the corresponding empty bin In the worst-case scenario, the memory required for holding query/object data is equal to the number of non-empty unrepeatable participants of both sides Hence, the performance is also sensitive to the internal structure of the dataset This approach could decrease the execution time compared to previously described methods as more data points can be processed simultaneously After the compression procedure, the distance calculations become completely independent and each distance is computed by the exact same number of arithmetical operations on the GPU The organization of work among the threads and thread blocks remains the same as for NS method Theoretically, in the worst-case, time complexity for reading from file including data arrangement (compression) is O(kQ + kD ) For the distance calculation we have O(kQ + kD ), which leads to O(2kQ +2kD ) total time complexity In the average case, when the number of the bins with the same bin ID is p, the time complexity becomes O(2kQ + 2kD − 2p) 4.3 Column-Based Solution Given sparse query vectors, inverted files represent a popular efficient index in document and multimedia retrieval systems Inverted files are usually coupled with the cosine similarity, hence only a fraction of the data files (corresponding to non-zero query bins) has to be visited to correctly answer a query Inspired by the inverted files, sparse vectors can be organized in columns such that vector ci represents i–th dimension from all database vectors (not in compact form) This database organization can be efficiently used with the Euclidean distance if the size of each database vector is precomputed or if all the vectors are normalized More specifically, let Iq , Io represent sets of non-zero bin IDs for vectors of query |q| = and data object |o| = Then, the squared Euclidean distance can Similarity Search of Sparse Histograms on GPU Architecture 333 be simply transformed to the following form: |qi − oi |2 = |qi − oi |2 = i∈Iq ∪Io i∈{1 d} |qi − oi |2 + i∈Iq i∈Io −Iq i∈Iq |qi − oi |2 + − = o2i o2i = + i∈Iq (|qi − oi |2 − o2i ) i∈Iq To follow this form, before starting distance calculations the size of each data object need to be computed (o2i ) Hence, in this approach, we have two kernels The first one is responsible for computing the size of vectors on GPU The second one computes the Euclidean distance (|qi − oi |2 ) after compressing the query and object vectors The main advantage of this approach compared with previous ones is that the Euclidean distance calculation only requires values stored in few bins with indexes from Iq Thus, in CBS, a large portion of data objects can be used for each iteration of GPU calculations However, depending on the size of the overlap Io ∩ Iq , there could be also a calculation overhead Theoretically, in the worst-case, the time complexity for vector size computation is O(kD + kQ ), for data compressing - O(kQ ) and for ∗ qi ∗ oi - O(p) when the number of bins with the same ID is p Hence, for each query and object the total time complexity is O(2kQ + kD + p) Experimental Results In this section, we present the hardware and dataset used for the experiments along with the results for different configurations 5.1 Hardware Setup Our experiments were conducted on a PC with an Intel Core i3-4010U CPU clocked at 1.7 GHz, which have physical cores and GB of RAM The desktop PC is equipped with NVIDIA GeForce GT 740M (Kepler architecture [24]), which have SMPs comprising 192 cores each (384 cores total) and GB of global memory The host used Windows as operating system and CUDA 5.1 framework for the GPGPU computations The experiments were timed using the real-time clock of the operating system Each experiment was conducted 10× and the arithmetic average of the measured values is presented as the result The results of four approaches were measured for random generated sparse datasets (475k objects, 400k queries) with different sparsity settings (query and objects have different number of bins with the same IDs) Dataset is loaded from the file where each line corresponds to one histogram Each histogram is in the compact form containing only non-empty bins The bins are separated by the comma and each bin includes the pair of ID and value (ID:V alue) 334 H Osipyan et al Fig Performance for various block sizes (k = kQ = kD = 1000, p = k/2, Q = {10k}, D = {475k}) 5.2 Results In Fig 4, we summarize the results of our three algorithms for different block sizes to find out the optimal GPU configuration The CDS algorithm is not presented in Fig as it is very slow (∼11×) compared even with the results of the worse GPU configuration (4 × 256 block size) Euclidean norm was calculated using CuBLAS library function (cublasSnrm2), which automatically picks optimized block size Let us note that the y component of the block size represents the number of query points cached Fig Performance for different number of bins with the same ID (k = kQ = kD = 1000, Q = {1}, D = {475k}) Similarity Search of Sparse Histograms on GPU Architecture 335 Fig Performance for different dimensions (kQ = 1000, Q = {1}, D = {475k}) in the shared memory The optimal performance for each algorithm is achieved when smaller amount of query points are cached (4 to 8) while larger number of object points are processed by the thread block The difference between performances of various block sizes is due to the available amount of shared memory on our GPU (48 kB) Therefore, the algorithms should perform better on a new generations of GPUs, which are expected to have even more shared memory per SMP For the next experiments, the most efficient block size configuration (256 × 4) was selected Figure presents the measured times for different number of bins with the same ID (p), when the number of non-empty bins in the query and data objects are equal NS and CS algorithms are very sensitive to the p while the CBS algorithm is more stable We explain the differences of results by the fact that CBS algorithm is working only with the non-empty bins of the query Conversely NS and CS rearrange query and object data where the number p plays a key role For large p number, our NS and CS algorithms give better results as they does not require vector size computation Hence, CBS can outperform other solutions if the vector sizes are precomputed In Fig 6, we show the kernel times obtained for different numbers of nonempty bins in objects (kD ) The experiments were conducted for the number of bins with the same ID equal to kQ /2 if kQ < kD (kD /2 if kD < kQ ) For small kQ , our three algorithms on GPU give approximately same results ratios kD outperforming the CPU baseline IFS method approximately 2−3× For larger kQ , CBS algorithm outperforms other GPU algorithms ratios kD This explains by the fact that in our CBS method the objects are used only for vector size computation and not all of them participate in distance calculation We show that IFS method is the best choice only for larger ratios while CBS-GPU is the best for the other cases 336 H Osipyan et al Fig Performance summary for different algorithms (kD = kQ = 1000, p = kQ /2, Q = {400k}, D = {475k}) To have final overview, in Fig 7, we present the total measured times for baseline CPU solutions (STS, IFS ) and four different GPU (CDS, NS, CS, CBS ) implementations for a huge query/object dataset The non-empty bins of queries (kQ ) and objects (kD ) are equal to 1000 and the number of same bin IDs p = kQ /2 = 500 For all methods (CPU and GPU), we use the same postprocessing steps (quick sort, top-k selection), which totally take less than Finally, let us note that NS, CS and CBS algorithms on the GPU give approximately 7−9× faster results than CPU baseline STS method CBS method itself provides 2−3× faster results than CPU baseline IFS method for huge kQ and 11−14× faster results than GPU baseline CDS implementation Conclusions In this work, we have analyzed the performance issues of similarity search for the sparse dataset In the high-dimensional spaces, the most time consuming operation is the distance calculation For sparse dataset, this operation becomes more expensive due to the conditional structure We have studied four hybrid approaches on the GPU and find out the optimal/fastest solution for different sparsity settings We showed that NS, CS and CBS approaches on the GPU outperformed CDS-GPU and CPU-only STS baseline solutions significantly and showed a promising potential for future scaling Experiments showed that for huge query dimensions our CBS method is faster even compared to the IFS-CPU method In addition, the internal structure of the sparse dataset played key role in the final performance Depending on the number of bins with the same ID, the CBS solution can be a better choice than NS and CS solutions and vice versa We finally note that CBS solution is the best choice if the vector sizes are precomputed Similarity Search of Sparse Histograms on GPU Architecture 337 As a future work, we are going to use frequently bin IDs for distance calculations where we will track the occurrence of each bin ID and will use it for later queries After thousands of queries, our learning system would be able to find similar objects in a faster way as only frequently asked bins will be processed on the GPU In addition, we are going to implement IFS GPU method and use it as a baseline solution We are going also to discuss other potential distance functions and to evaluate their affect on the solutions discussed Acknowledgments This paper was supported by the Czech Science Foundation project 15-08916S and by the project SVV-2016-260331 and in relation to the SNF (Swiss National Foundation) project MAAYA (grant number 144238) References Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files In: Proceedings of the 3rd International Conference on Scalable Information Systems, pp 28:1–28:10 (2008) Ashari, A., Sedaghati, N., Eisenlohr, J., Sadayappan, P.: An efficient twodimensional blocking strategy for sparse matrix-vector multiplication on GPUs In: ICS 2014, Muenchen, Germany, 10–13 June 2014, pp 273–282 (2014) Chang, D., Jones, N.A., Li, D., Ouyang, M., Ragade, R.K.: Compute pairwise Euclidean distances of data points with GPUs In: Proceedings of the IASTED International Symposium on CBB, Florida, USA, 16–18 November 2008, pp 278– 283 (2008) Cui, B., Zhao, J., Cong, G.: ISIS: a new approach for efficient similarity search in sparse databases In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C (eds.) DASFAA 2010 LNCS, vol 5982, pp 231–245 Springer, Heidelberg (2010) Dotsenko, Y., Govindaraju, N.K., Sloan, P.J., Boyd, C., Manferdelli, J.: Fast scan algorithms on graphics processors In: Proceedings of the 22nd Annual ICS, Island of Kos, Greece, 7–12 June 2008, pp 205–213 (2008) Fang, J., Varbanescu, A.L., Sips, H.J.: A comprehensive performance comparison of CUDA and OpenCL In: ICPP, Taipei, Taiwan, September 2011, pp 216–225 (2011) Garcia, V., Debreuve, E., Barlaud, M.: Fast k nearest neighbor search using GPU In: IEEE Conference on CVPR, Anchorage, USA, 23–28 June 2008, pp 1–6 (2008) Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing In: Proceedings of the 25th International Conference on VLDB 1999, pp 518– 529 Morgan Kaufmann Publishers Inc., San Francisco (1999) Goumas, G.I., Kourtis, K., Anastopoulos, N., Karakasis, V., Koziris, N.: Understanding the performance of sparse matrix-vector multiplication In: 16th Euromicro International Conference on PDP, pp 283–292 (2008) 10 Karimi, K., Dickson, N.G., Hamze, F.: A performance comparison of CUDA and OpenCL CoRR abs/1005.2581 (2010) 11 Khronos OpenCL Working Group: The OpenCL Specification, version 1.0.29, December 2008 12 Kohout, J., Pevny, T.: Unsupervised detection of malware in persistent web traffic In: IEEE International Conference on ICASSP (2015) 338 H Osipyan et al 13 Kruliˇs, M., Osipyan, H., Marchand-Maillet, S.: Optimizing Sorting and top-k selection steps in permutation based indexing on GPUs In: Morzy, T., Valduriez, P., Bellatreche, L (eds.) ADBIS 2015 CCIS, vol 539, pp 305–317 Springer, Heidelberg (2015) 14 Krulis, M., Osipyan, H., Marchand-Maillet, S.: Permutation based indexing for high dimensional data on GPU architectures In: 13th International Workshop on CBMI, Prague, Czech Republic, 10–12 June 2015, pp 1–6 (2015) 15 Li, Q., Kecman, V., Salman, R.: A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU In: The Ninth ICMLA, Washington, DC, USA, 12–14 December 2010, pp 208–213 (2010) 16 Li, S., Amenta, N.: Brute-force k -nearest neighbors search on the GPU In: Amato, G., Connor, R., Falchi, F., Gennaro, C (eds.) SISAP 2015 LNCS, vol 9371, pp 259–270 Springer, Heidelberg (2015) doi:10.1007/978-3-319-25087-8 25 17 Liang, S., Liu, Y., Wang, C., Jian, L.: A cuda-based parallel implementation of k-nearest neighbor algorithm In: Cyber-Enable Distributed Computing and Knowledge Discovery, pp 291–296 (2010) 18 Liu, K., Bellet, A., Sha, F.: Similarity learning for high-dimensional sparse data In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS, San Diego, California, USA, 9–12 May 2015 (2015) 19 Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication In: Proceedings of the 29th ACM on ICS 2015, Newport Beach/Irvine, CA, USA, 8–11 June 2015, pp 339–350 (2015) 20 Matsumoto, T., Yiu, M.L.: Accelerating exact similarity search on CPU-GPU systems In: ICDM, Atlantic City, NJ, USA, 14–17 November 2015, pp 320–329 (2015) 21 Mohamed, H., Osipyan, H., Marchand-Maillet, S.: Multi-core (CPU and GPU) for permutation-based indexing In: Traina, A.J.M., Traina Jr., C., Cordeiro, R.L.F (eds.) SISAP 2014 LNCS, vol 8821, pp 277–288 Springer, Heidelberg (2014) 22 Neelima, B., Raghavendra, P.S.: CSPR: column only sparse matrix representation for performance improvement on GPU architecture In: Advances in Parallel Distributed Computing, Tirunelveli, India, 23–25 September 2011, pp 581–595 (2011) 23 Neelima, B., Reddy, G.R.M., Raghavendra, P.S.: A GPU framework for sparse matrix vector multiplication In: IEEE 13th International Symposium on Parallel and Distributed Computing, ISPDC, Marseille, France, June 2014, pp 51–58 (2014) 24 Corporation, N.: Kepler GPU Architecture http://www.nvidia.com/object/ nvidia-kepler.html 25 NVIDIA Corporation: NVIDIA CUDA C programming guide, version 3.2 (2010) 26 Pan, J., Manocha, D.: Fast GPU-based locality sensitive hashing for k-nearest neighbor computation In: 19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, Chicago, IL, USA, pp 211–220 (2011) 27 Samet, H.: Foundations of Multidimensional and Metric Data Structures The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling Morgan Kaufmann Publishers Inc., San Francisco (2005) 28 Teodoro, G., Valle, E., Mariano, N., da Silva Torres, R., M Jr., W., Saltz, J.H.: Approximate similarity search for online multimedia services on distributed CPUGPU platforms VLDB J 23(3), 427–448 (2014) 29 Wang, C., Wang, X.S.: Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches VLDB J 9(4), 344–361 (2001) 30 Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach, 1st edn Springer, New York (2010) Author Index Amato, Giuseppe Arafa, Mona 93, 196 Benedetti, Fabio 164 Beneventano, Domenico 164 Bergamaschi, Sonia 164 Bergen, Karianne 301 Beroza, Gregory C 301 Blažek, Adam 117 Bountouridis, Dimitrios 286 Brown, Kyle 181 Calders, Toon Cardillo, Franco Alberto 51 Carmel, David 236 Čech, Přemysl 311 Connor, Richard 51, 65, 210 Cui, Cewei 34 Curtin, Ryan R 221 Dang, Zhe Lokoč, Jakub 117, 311, 325 Marchand-Maillet, Stéphane 79, 325 Maroušek, Jakub 311 Martínez, Fredy 125 Mic, Vladimir 250 Miura, Takao 151 Mohamed, Hisham 79 Nielsen, Frank 79, 109 Nock, Richard 109 Novak, David 250 Orenbach, Meni 236 Osipyan, Hasmik 325 Pevný, Tomáš 311 34 Elias, Petr 271 Falchi, Fabrizio 93, 196 Gardner, Andrew B 221 Gennaro, Claudio 93, 196 Gouda, Karam Guevara, Pedro 125 Hirata, Kouichi Kuboň, David 117 Kuboyama, Tetsuji 259 259 Rabitti, Fausto 51, 196 Reed, Jason 181 Rendón, Angelica 125 Roman-Rangel, Edgar 79 Rong, Chuitian 181 Sedmidubsky, Jan 271 Shinohara, Takeshi 259 Silva, Yasin N 181 Skala, Matthew 137 Imamura, Yasunobu 259 Iwasaki, Masajiro 20 Vadicamo, Lucia 51, 93 Veltkamp, Remco C 286 Keidar, Idit 236 Kohout, Jan 311 Komárek, Tomáš 311 Konaka, Fumito 151 Koops, Hendrik Vincent Kraus, Naama 236 Wadsworth, Adelbert Wiering, Frans 286 181 Yoon, Clara 301 286 Zezula, Pavel 250, 271 ... Science ISBN 97 8-3 -3 1 9-4 675 8-0 ISBN 97 8-3 -3 1 9-4 675 9-7 (eBook) DOI 10.1007/97 8-3 -3 1 9-4 675 9-7 Library of Congress Control Number: 20169 54121 LNCS Sublibrary: SL3 – Information Systems and Applications, ... Switzerland Preface This volume contains the papers presented at the 9th International Conference on Similarity Search and Applications (SISAP 2016) held in Tokyo, Japan, during October 24–26, 2016 SISAP. .. Amsaleg Michael E Houle Erich Schubert (Eds.) • Similarity Search and Applications 9th International Conference, SISAP 2016 Tokyo, Japan, October 24–26, 2016 Proceedings 123 Editors Laurent Amsaleg