2015 Seventh International Conference on Knowledge and Systems Engineering Fast Approximate Near Neighbor Algorithm by Clustering in High Dimensions Hung Tran-The Z8 FPT Software JSC Hanoi, VietNam Email: hungtt5@fsoft.com.vn Vinh Nguyen Van University of Engineering and Technology Vietnam National University, Hanoi Hanoi, VietNam Email: vinhnv@vnu.edu.vn Abstract—This paper addresses the (r, + )-approximate near neighbor problem ( or (r, + )-NN) that is defined as follows: given a set of n points in a d-dimensional space, a query point q and parameter < δ < 1, build a data structure that reports a point within distance (1 + )r from q with probability − δ, if there is a point in the data set within distance r from q We present an algorithm for this problem in the Hamming space by using a new clustering technique, the triangle inequality and the locality sensitive 2ρ 1+ hashing approach Our algorithm achieves O(n 1+2ρ + dn) 2ρ space, and O(f dn 1+2ρ ) query time, where f is generally a small integer, ρ is a parameter of the algorithm in (STOC 1998) [1] or (STOC 2015) [2] Our results show that we can improve the algorithms in [1], [2] when value is small ( < for [1] and < 0.5 for [2]) times the distance from the query to its nearest points, where > is called the approximation factor Many algorithms for the (1 + )-approximate nearest neighbor problem can be found in [4], [1], [5], [6], [7], [8], [9], [10], [11] The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one An efficient approximation algorithm can be used to solve the exact nearest neighbor problem, by enumerating all approximate nearest neighbors and choosing the closest point In [6], [1], [11], [8], the authors constructed data structures for the + -approximate nearest neighbor problem which avoided the curse of dimensionality To be more specific, for any constant > 0, the data structures support queries in time O(dlog(n)), and use space which is polynomial in n Unfortunately, the exponent in the space bounds is roughly C/ (for < 1), where C is a “nonnegligible” constant Thus, even for, say, = 1, the space used by the data structure is large enough so that the algorithm becomes impractical even for relatively small data sets From the practical perspective, the space used by an algorithm should be as close to linear as possible If the space bound is (say) sub-quadratic, and the approximation factor c is a constant, the best existing solutions are based on locality sensitive hashing(LSH) One other attractive feature of the locality sensitive hashing approach is that it enjoys a rigorous theoretical performance guarantee even in the worst case Locality Sensitive Hashing: The core idea of LSH approach is to hash items in a similarity preserving way, i.e., it tries to store similar items in the same buckets, while keeping dissimilar items in different buckets In [1], [12] the authors provided such locality-sensitive hash functions for the case when the points live in binary Hamming space {0, 1}d The process of the LSH approach has two steps: training and querying In the training step, LSH first learns a hash function h(p) = {h1 (p), h2 (p), , hm (p) : hi (p) ∈ {0, 1}} where m is the code length They use projections of the input point on one of the coordinates, that is, the hash function of form hi (p) = pi for every i = 1, , m Then, LSH represents each item in the database as a hash code by the hash mapping h(x), and constructs a hash table by I I NTRODUCTION A similarity search problem involves a collection of objects (documents, images, etc.) that are characterized by a collection of relevant features and represented as points in a high dimensional attribute space Given a query and a set of n objects in the form of points in the d-dimensional space, we are required to find the nearest (most similar) objects to the query This is called the nearest neighbor problem This problem is of major importance to a variety of applications, such as data compression, databases and data mining, information retrieval, image and video databases, machine learning, pattern recognition, statistics and data analysis There are several efficient algorithms known for the case when the dimension d is “low” (see [3] for an overview) However, the main issue is to deal with a high-dimensional data Despite decades of intensive effort, the current solutions suffer from either space or query time that is exponential in d In fact, for large enough d, in theory or in practice, they often provide little improvement over a linear algorithm that compares a query to each point from the database This phenomenon is often called “the curse of dimensionality” In recent years, several researchers proposed methods for overcoming this state of affaires by using approximation algorithms for the problem In that formulation, the (1 + )approximate nearest neighbor algorithm is allowed to return a point, whose distance from the query is at most (1 + ) 978-1-4673-8013-3/15 $31.00 © 2015 IEEE DOI 10.1109/KSE.2015.72 Minh Hoang Anh Z8 FPT Software JSC Hanoi, VietNam Email: minhha@fsoft.com.vn 274 hashing each item into the bucket indexed by its code In the querying step, LSH first converts the query into a hash code, and then finds its nearest neighbor in the bucket indexed by the code If the probability of collision is at least p = 1−r/d for the close points and at most q = − (1 + )r/d for the far points, they achieve the algorithms for the (r, + )-NN using O(n1+ρ + nd) spaces, O(dn1+ρ ) preprocessing time, and O(dnρ ) query time, where ρ = log(1/p)/log(1/q) For r < d/log(n), ρ < 1+ Related Works: In a followup work [9], the authors introduced LSH functions that work directly in Euclidean space and result in a faster running time The latter algorithm forms the basis of E2 LSH package for high-dimensional similarity search, which has been used in several applied scenarios In this paper, we only focus on the Hamming space and the improvement of efficiency of locality sensitive hashing approach The efficiency can be obtained via improvement of exponent ρ The LSH algorithm has been since used in numerous applied setting as [13], [12], [14], [15], [16] The work of [17] showed that LSH for Hamming space 1 must have ρ ≥ 2(1+ ) − O( 1+ ) − o (1) In a recent paper of Alexandr Andoni et al (SODA 2014) [18], the authors proposed a version of LSH approach by using essentially the same LSH functions families as described in [1] However, the properties of those hash functions as well as the overall algorithm are different This approach leads to a two-level hashing algorithm The outer hash table partitions the data sets into buckets of bounded diameter Then, for each bucket, they build the inner hash table, which uses (after some pruning) the center of the minimum enclosing ball of the points in the bucket as a center point This hashing type is data-dependent hashing, i.e., a randomized hash family that itself depends on the actual points in the dataset Their algorithm achieves O(nρ + dlog(n)) query time and Oc (n1+ρ + dlog(n)) space, where ρ ≤ 8(1+ ) +O( (1+ )3/2 )+o (1)) In a very recent paper of the same authors (accepted at STOC 2015) [2], they continue to consider the data-dependent hashing for the optimality of exponent ρ Their algorithm achieves O(n1+ρ +dn) space and O(dnρ ) query time, where ρ = 2(1+1 )−1 Contribution: We present an algorithm for the (r, 1+ )NN problem in the Hamming space by using a new clustering technique, the triangle inequality as a stage of preprocessing To the best of our knowledge, this paper presents first the clustering algorithm as a preprocessing before using LSH approach in order to solve (r, + )-NN problem Our algorithm achieves the following advantages: • Query Time Space Comment STOC 1998 [1] STOC 2015 [2] O(dnρ ) O(dnρ ) O(n1+ρ + dn) O(n1+ρ + dn) ρ < 1+ ρ = 2(1+1 )−1 Our results O(f dn 1+2ρ ) 2ρ O(n 2ρ 1+ 1+2ρ Table I S PACE AND TIME BOUNDS FOR LSH + dn) ρ of original algorithm ALGORITHMS AND OUR ALGORITHM 2ρ • integer See the table We see that n 1+2ρ < nρ if 2ρ 1+2ρ < nρ if 4r then the f is smaller than d−2r averaging value is Note that there are not many such LSH algorithms for (r, 1+ )-NN problem in Hamming space We can cite here two algorithms in [1], [2] We provide a simple and easy algorithm in two indexing and querying procedures The remainder of the paper is structured as follows Section II provides a formal definition of the problem we are tackling Our solution to the problem is presented in Sections III Then we give the correctness of our algorithm in Sections IV and the complexity analysis in Section V Finally, Section VI concludes the paper II P ROBLEM D EFINITION In this paper, we solve the (r, + )-approximate near neighbor problem in the Hamming space: Definition ((r, + )-approximate near neighbor, or (r, + )-NN) Given a set P of points in a d-dimensional Hamming space Hd , and δ > 0, we construct a data structure which, given any query point q, does the following task, with probability − δ, if there exists a r-near neighbor of q in P, it reports a (1 + )r-near neighbor of q in P Similarly to [1], [12], δ is an absolute constant bounded away from Formally, an r-near neighbor of q is a point p such that d(p, q) ≤ r, where d(p, q) denotes the Hamming distance between p and q A set C(q, r) is one that contains all r-near neighbors of q Observe that (r, + )-NN is simply a decision version of the Approximate Nearest Neighbor problem Although in many applications solving the decision version is good enough, one can also reduce the approximate NN problem to approximate NN via binary-search-like approach In particular, it is known [1] that the + -approximate NN problem reduces to O(log(n/ )) instances of (r, + )-NN Then, the complexity of + -approximate NN is the same (within log factor) as that of the (r, + )-NN problem We give an improvement in case where is small for the classic LSH algorithms in [1] and in very recent paper [2], by choosing K = O(n 1+2ρ ), we 2ρ achieve an algorithm with O(n1+ 1+2ρ + dn) space, and 2ρ O(f dn 1+2ρ ) query time, where f is generally a small 275 III A LGORITHM Input: a parameter K and a set P = {p1 , p2 , , pn } C = ∅ Output: a set of clusters C while P = ∅ select randomly a point c from P to initialize a cluster search for the set N (c) of O(K) nearest neighbors of c from P by using Merge sort algorithm on distances of data to c add these points in N (c) to the cluster c add the cluster c to set C 10 remove all points of the cluster c from P The key idea of our algorithm is that one can greatly reduce the searching space by cheaply partitioning data into clusters, which we call stage Then we use the triangle inequality to select a few matched clusters and finally use an LSH algorithm for each selected cluster We call this stage Such space partitioning approach is very natural for searching problem and it is applicable in many settings For example, [19] solves the exact K-nearest neighbors problem using the K-means clustering in order to reduce the searching space, [20] solves the exact K-nearest neighbors classification basing on homogeneous clusters There are a few of clustering algorithms that can be applicable for our case as K-means algorithm and Canopy algorithm [21] The key idea of K-means is to partition data into K clusters in which each point of data belongs to the cluster with the nearest mean while the key idea of Canopy algorithm involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies with the same radius However, the disadvantage of such algorithms is that (1) we not know how many points in a cluster and (2) the computation for K-means algorithm is expensive while canopies may be overlapped for canopy algorithm We introduce here a simple clustering algorithm that ensure properties: (1) we know exactly how many points in a cluster and (2) the computation for this algorithm is cheap Figure K-Nearest Neighbor Clustering Algorithm Figure Example of clustering algorithm B Searching Algorithm In the second stage, we select clusters according to some conditions, which we call step 1, and then we use a LSH algorithm for each satisfied cluster in order to search the nearest neighbor, which we call step Since there are O(K) n ) clusters If points in each clusters, there are at most O( K by some elimination mechanism, such as branch and bound, we only select a few clusters then the search space for the LSH algorithm will be greatly reduced Step 1: After the clustering algorithm, we achieve a set of clusters C For simplicity, for each cluster with the center c, let R(c) be the radius of the cluster We will use the triangle inequality in order to select only a few clusters among all clusters Such a cluster is called a candidate that is defined as follows: A Clustering Algorithm Definition A K-cluster with the center c is a subset S of P such that: 1) |S| = O(K), 2) all points of S are O(K) nearest neighbors of c The ideal of K-nearest neighbor clustering is as follows: at first, we choose randomly a point c of P which is the center of a cluster, then we calculate O(K) nearest neighbors of c by using Merge sort algorithm The Merge sort algorithm sorts the distances of dataset to c in ascending order and stocks them in an array We select O(K) points that correspond to the O(K) first elements of the array These O(K) points and point c form a cluster with radius r which is the O(K)-th element of the array Hence, it is easy to see that the distances of these O(K) points to c are smaller than r After, we remove these points from P and continue to choose randomly an other point as the center of the next cluster The process stops when all points of P are removed and belong to a cluster The details of algorithm is represented in Figure Figure illustrates an example of k-nearest neighbor clustering algorithm result Given a set P with 12 points {a, b, c, d, e, f, g, m, n, o, p, q}, we achieve clusters with centers p, q, t and each cluster contains points Definition Given a query point q, we say that a cluster is a candidate of q if two following conditions are satisfied: • d(q, c) ≤ R(c) + r, • for each center of cluster with center c such that c = c, then d(q, c ) ≥ R(c ) − r Step 2: In this step, we select a LSH algorithm ( [1] or [1]) for (r, + )-NN in order to search for the nearest neighbor in each cluster satisfied in step All points in each cluster are structured according to the reprocessing algorithm of the selected algorithm The only difference is that we use twice the number of buckets of selected algorithm in each hash table and n2ρ hash tables instead of nρ hash tables 276 n in the worst-case we must consider all O( K ) clusters, and thus if we use the parameters of the original algorithm, the number of collisions of q with points in P\C(q, (1 + )r) n ) This value is large To overcome is a function of O( K this, we use two times number of buckets of the original algorithm Hence, the probability of collision of q and some point p ∈ P\C(q, (1 + )r) reduces n times Thus, the total number of collisions of q with points in P\C(q, (1 + )r) is n ) clusters small even when we must examine all O( K Input: a parameter K and a set of clusters C = {c} Candidates = ∅ Query algorithm for a query point q: Step 1: for each c ∈ C: if c is a candidate of q then add c to Candidates Step 2: for each c in Candidates: LSHAlgorithm(q, c) and stop Figure V C OMPLEXITY A NALYSIS Searching Algorithm using LSH Approach Given any query point q, let f be the number of candidates of q We get: The reason will be discussed in Section IV We denote by LSHAlgorithm(q, c) the query algorithm of the selected LSH algorithm, where q is the query point and c is the center of a cluster This query algorithm returns a nearest neighbor if it finds a nearest neighbor in the cluster with the center c, otherwise the query algorithm returns null See the searching algorithm in Figure Theorem Given a parameter K > and any LSH algorithm the exponent parameter ρ in [1] or [2]), there exists a data structure for (r, + )-NN problem in the Hamming space with: n 1+2ρ • preprocessing time O(d( K )K + dnlog(K)), n 1+2ρ • space O( K )K + dn), n 2ρ • querying time O(d K + f dK ) IV C ORRECTNESS Proof: For each loop in clustering algorithm, we need determine O(K) nearest neighbors of a center This requires O(dKlog(K) processing time by using a sorting algorithm As in each loop, we can determine a cluster containing O(K) points of P , Algorithm terminates after n n ) loops Hence, Algorithm uses O(( K )dKlog(K)) = O( K O(dnlog(K)) processing time The processing time for building buckets in each cluster is K 1+ 1+ Hence, the total n 1+2ρ + dnlog(K)) preprocessing time is O(d( K )K For each loop in clustering algorithm, we can determine O(K) nearest neighbors of a center by using the sorting algorithm A efficient sorting algorithm only requires n ) loops, Algorithm uses O(dK) spaces Hence, after O( K n O(( K )dK) = O(dn) spaces The needed space for stocking buckets in each cluster in searching algorithm is O(K 1+ρ + n dK) Hence, the total space is O(( K )K 1+2ρ + dn) The total query time consists of the time for searching n ) clusters in step and the time for f clusters from O( K searching (1 + , r)-approximate nearest neighbors in each cluster found in step Hence, the total query time is n ) + O(f K 2ρ ) O( K If we let K = O(n 1+2ρ ) Then the preprocessing time 2ρ 2ρ is O(n1+ 1+2ρ + dn(log(n)), the space is O(n1+ 1+2ρ + dn) 2ρ and the query time is O((f + 1)dn 1+2ρ ) So we get Lemma Given any query point q, if there exists p ∈ P and p ∈ C(q, r) then there exists a cluster ∈ C such that (1) p belongs to the cluster and (2) this cluster is a candidate of q Proof: According to the clustering algorithm, p must be within only some cluster with the center c We will show that this cluster is the candidate of q Indeed, by the triangle inequality in Hamming space, d(q, c) ≤ d(c, p) + d(p, q) Moreover, as p is within the cluster with center c, d(c, p) ≤ R(c) As p ∈ C(q, r), d(p, q) ≤ r So we get d(q, c) ≤ R(c) + r Now, consider any cluster with center c other than c By the triangle inequality, d(q, c ) ≥ d(c , p) − d(p, q) As p is within the cluster with center c, p is not within cluster with center c Then d(c , p) ≥ R(c ) Moreover, d(p, q) ≤ r Hence, d(q, c ) ≥ R(c ) − r By definition 3, the lemma holds Similarly as in [1], the correctness of our algorithm holds if we can ensure that with a constant probability, the following two properties hold: • if there exists p ∈ C(q, r), then q will collide p • the total number of collisions of q with points in P\C(q, (1 + )r) is small The property holds thank to Lemma Indeed, given any query point q, if there a p is r-near neighbor of q then p must be in a candidate of q Hence, from the correctness of LSH algorithm used for this candidate cluster in stage 2, we obtain the property For the property 2, it is not like in [1] or [2], we need to use two times number of buckets of the original algorithm in each hash table, and n2ρ buckets instead of nρ buckets Because if we use nρ buckets, then Theorem There exists a data structure for (r, + )-NN problem in the Hamming space with: • • • 2ρ preprocessing time O(n1+ 1+2ρ + dn(log(n)), 2ρ space O(n1+ 1+2ρ + dn), 2ρ querying time O(f dn 1+2ρ ) Next, we will show that value f on average is small 277 Lemma Given any query point q and a parameter f ≥ 2, if c1 , c2 , , cf are candidates of q then for every ≤ i ≤ f : R(ci ) − r ≤ d(q, ci ) ≤ R(ci ) + r reviewing our paper R EFERENCES Proof: It is direct from Lemma [1] P Indyk and R Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ser STOC ’98 New York, NY, USA: ACM, 1998, pp 604–613 [Online] Available: http://doi.acm.org/10.1145/276698.276876 Lemma Given any query point q and a K-cluster with center c Then the probability that R(c) − r ≤ d(q, c) ≤ R(c) + r is smaller than 2r d Proof: There are two cases: if R(c) > r then as R(c) − r ≤ d(q, c) ≤ R(c) + r, distance d(q, c) belongs to a range of length 2r If R(c) ≤ r then as d(q, c) ≤ R(c) + r, < d(q, c) ≤ 2r Hence,d(q, c) also belongs to a range of 2r Thus, in both case, the probability to find a point q such that d(q, c) belongs to a a range of length 2r in a d-dimensional space is smaller than 2r d [2] A Andoni and I Razenshteyn, “Optimal datadependent hashing for approximate near neighbors,” CoRR, vol abs/1501.01062, 2015 [Online] Available: http://arxiv.org/abs/1501.01062 [3] H Samet, Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005 Lemma Given any query point q Then the probability f that q has f candidates is ( 2r d ) Proof: It is direct from Lemma and Lemma [4] S Arya, D M Mount, N S Netanyahu, R Silverman, and A Y Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” J ACM, vol 45, no 6, pp 891–923, Nov 1998 [Online] Available: http://doi.acm.org/10.1145/293347.293348 Lemma Given any query point q Then the average d candidates f the probability of q is smaller than d−2r Proof: As ≤ f ≤ n/K, we get f = n/K+1 1−( 2r d ) 1− 2r d n/K 2r i 0=1 ( d ) = d d−2r < From the Lemma 5, for d > 4r the average candidates f < 2, so f = Now, we return the complexity analysis of 2ρ < Then the algorithm We see that n 1+2ρ < nρ if 1+2ρ [5] J M Kleinberg, “Two algorithms for nearest-neighbor search in high dimensions,” in Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, ser STOC ’97 New York, NY, USA: ACM, 1997, pp 599–608 [Online] Available: http://doi.acm.org/10.1145/258533.258653 2ρ we will achieve n 1+2ρ < nρ if < for the case of [1] and if < 0.5 for the case of [2] As above proved, we showed the averaging value of f is very small (if d > 4r then f = 1) Hence, our algorithm is better than original algorithms of [1] and [2] when is small as mentioned and f is small [6] E Kushilevitz, R Ostrovsky, and Y Rabani, “Efficient search for approximate nearest neighbor in high dimensional spaces,” in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ser STOC ’98 New York, NY, USA: ACM, 1998, pp 614–623 [Online] Available: http://doi.acm.org/10.1145/276698.276877 VI C ONCLUSION In this paper, we presented an algorithm for (r, + )NN problem in Hamming space The algorithm uses a new clustering technique, the triangle inequality and the locality sensitive hashing approach that permits solve (r, 1+ )-NN problem in high-dimensional space We achieved 2ρ 2ρ O(n1+ 1+2ρ +dn) space, and O(f dn 1+2ρ ) query time, where f is generally a small integer, ρ is a parameter of the algorithm Our result is an improvement over the those of [1], [2] in the case where parameters and f are small We saw that our algorithm can be investigated more deeply by looking into the tuning parameter and can be applied in different applications with high-dimensional data, such as movie recommendation, speech recognition, which will be our future works [7] S Har-Peled, “A replacement for voronoi diagrams of near linear size,” in Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, ser FOCS ’01 Washington, DC, USA: IEEE Computer Society, 2001, pp 94– [Online] Available: http://dl.acm.org/citation.cfm?id=874063.875592 [8] N Ailon and B Chazelle, “Approximate nearest neighbors and the fast johnson-lindenstrauss transform,” in Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing, ser STOC ’06 New York, NY, USA: ACM, 2006, pp 557–563 [Online] Available: http://doi.acm.org/10.1145/1132516.1132597 [9] M Datar, N Immorlica, P Indyk, and V S Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser SCG ’04 New York, NY, USA: ACM, 2004, pp 253–262 [Online] Available: http://doi.acm.org/10.1145/997817.997857 ACKNOWLEDGEMENT We thank our colleges in Z8, FPT Software, in particular Henry Tu for insightful discussions about the problem and 278 [21] A McCallum, K Nigam, and L H Ungar, “Efficient clustering of high-dimensional data sets with application to reference matching,” in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser KDD ’00 New York, NY, USA: ACM, 2000, pp 169–178 [Online] Available: http://doi.acm.org/10.1145/347090.347123 [10] S Har-Peled and S Mazumdar, “On coresets for kmeans and k-median clustering,” in Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, ser STOC ’04 New York, NY, USA: ACM, 2004, pp 291–300 [Online] Available: http://doi.acm.org/10.1145/1007352.1007400 [11] A Chakrabarti and O Regev, “An optimal randomised cell probe lower bounds for approximate nearest neighbor searching,” in In Proceedings of the Symposium on Foundations of Computer Science [12] A Gionis, P Indyk, and R Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases, ser VLDB ’99 San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp 518–529 [Online] Available: http://dl.acm.org/citation.cfm?id=645925.671516 [13] J Buhler, “Efficient large-scale sequence comparison by locality-sensitive hashing,” Bioinformatics, vol 17, no 5, pp 419–428, 2001 [Online] Available: http://dx.doi.org/10.1093/bioinformatics/17.5.419 [14] E Cohen, M Datar, S Fujiwara, A Gionis, P Indyk, R Motwani, J D Ullman, and C Yang, “Finding interesting associations without support pruning,” in ICDE, 2000, pp 489–500 [Online] Available: http://dx.doi.org/10.1109/ICDE.2000.839448 [15] G Bogdan, S Ilan, and M Peter, “Mean shift based clustering in high dimensions: A texture classification example,” in Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ser ICCV ’03 Washington, DC, USA: IEEE Computer Society, 2003, pp 456– [Online] Available: http://dl.acm.org/citation.cfm?id=946247.946595 [16] J Buhler, “Provably sensitive indexing strategies for biosequence similarity search,” in Proceedings of the Sixth Annual International Conference on Computational Biology, ser RECOMB ’02 New York, NY, USA: ACM, 2002, pp 90–99 [Online] Available: http://doi.acm.org/10.1145/565196.565208 [17] R Motwani, A Naor, and R Panigrahi, “Lower bounds on locality sensitive hashing,” in Proceedings of the Twenty-second Annual Symposium on Computational Geometry, ser SCG ’06 New York, NY, USA: ACM, 2006, pp 154–157 [Online] Available: http://doi.acm.org/10.1145/1137856.1137881 [18] A Andoni, P Indyk, H L N Łn, and I Razenshteyn, “Beyond localitysensitive hashing.” [19] X Wang, “A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality,” in The 2011 International Joint Conference on Neural Networks, IJCNN 2011, San Jose, California, USA, July 31 - August 5, 2011, 2011, pp 1293–1299 [Online] Available: http://dx.doi.org/10.1109/IJCNN.2011.6033373 [20] S Ougiaroglou and G Evangelidis, “Efficient $$k$$k-nn classification based on homogeneous clusters,” Artif Intell Rev., vol 42, no 3, pp 491–513, Oct 2014 [Online] Available: http://dx.doi.org/10.1007/s10462-013-9411-1 279 ...hashing each item into the bucket indexed by its code In the querying step, LSH first converts the query into a hash code, and then finds its nearest neighbor in the bucket indexed by the code... solve the (r, + ) -approximate near neighbor problem in the Hamming space: Definition ((r, + ) -approximate near neighbor, or (r, + )-NN) Given a set P of points in a d-dimensional Hamming space Hd ,... simple clustering algorithm that ensure properties: (1) we know exactly how many points in a cluster and (2) the computation for this algorithm is cheap Figure K-Nearest Neighbor Clustering Algorithm