Near Neighbor Search in High Dimensional Data (2) Locality-Sensitive Hashing (continued) LS Families and Amplification LS Families for Common Distance Measures Anand Rajaraman The Big Picture Document Shingling Minhashing The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets, and reflect their similarity Localitysensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Candidate Pairs • Pick a similarity threshold s – e.g., s = 0.8. – Goal: Find documents with Jaccard similarity at least s. • Columns i and j are a candidate pair if their signatures agree in at least a fraction s of their rows • We expect documents i and j to have the same similarity as their signatures. LSH for Minhash Signatures • Big idea: hash columns of signature matrix M several times. • Arrange that (only) similar columns are likely to hash to the same bucket, with high probability • Candidate pairs are those that hash to the same bucket Partition Into Bands r rows per band b bands One signature Signature Matrix M Buckets Matrix M Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are surely different. r rows b bands Partition into Bands – (2) • Divide matrix M into b bands of r rows. – Create one hash table per band • For each band, hash its portion of each column to its hash table • Candidate pairs are columns that hash to the same bucket for ≥ 1 band. • Tune b and r to catch most similar pairs, but few nonsimilar pairs. Simplifying Assumption • There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. • Hereafter, we assume that “same bucket” means “identical in that band.” • Assumption needed only to simplify analysis, not for correctness of algorithm. Example of bands • 100 min-hash signatures/document • Let’s choose choose b = 20, r = 5 – 20 bands, 5 signatures per band • Goal: find pairs of documents that are at least 80% similar. Suppose C1, C2 are 80% Similar • Probability C1, C2 identical in one particular band: (0.8)5 = 0.328. • Probability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 . – i.e., about 1/3000th of the 80%-similar column pairs are false negatives – We would find 99.965% pairs of truly similar documents Suppose C1, C2 Only 30% Similar • Probability C1, C2 identical in any one particular band: (0.2)5 = 0.00243 • Probability C1, C2 identical in ≥ 1 of 20 bands: 20 * 0.00243 = 0.0486 • In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs – False positives LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/ negatives. • Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up. Analysis of LSH – What We Want Probability = 1 if s > t Probability of sharing a bucket No chance if s < t t Similarity s of two sets What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets b bands, r rows/band • Columns C and D have similarity s • Pick any band (r rows) – Prob. that all rows in band equal = s r – Prob. that some row in band unequal = 1 - s r • Prob. that no band identical = (1 - s r)b • Prob. that at least 1 band identical = 1 - (1 - s r)b What b Bands of r Rows Gives You At least one band identical t ~ (1/b)1/r Probability of sharing a bucket t Similarity s of two sets No bands identical 1 - (1 - s r )b Some row All rows of a band of a band unequal are equal Example: b = 20; r = 5 s .2 .3 .4 .5 .6 .7 .8 1-(1-sr)b .006 .047 .186 .470 .802 .975 .9996 LSH Summary • Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. • Check in main memory that candidate pairs really do have similar signatures. • Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents. The Big Picture Document Shingling Minhashing The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets, and reflect their similarity Localitysensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Theory of LSH • We have used LSH to find similar documents – In reality, columns in large sparse matrices with high Jaccard similarity – e.g., customer/item purchase histories • Can we use LSH for other distance measures? – e.g., Euclidean distances, Cosine distance – Let’s generalize what we’ve learned! Families of Hash Functions • For min-hash signatures, we got a minhash function for each permutation of rows • An example of a family of hash functions – A (large) set of related hash functions generated by some mechanism – We should be able to effciently pick a hash function at random from such a family Locality-Sensitive (LS) Families • • Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : 1. If d(x,y) < d1, then prob. over all h in H, that h(x) = h(y) is at least p1. 2. If d(x,y) > d2, then prob. over all h in H, that h(x) = h(y) is at most p2. A (d1,d2,p1,p2)-sensitive function p1 Pr[h(x) = h(y)] p2 d1 d2 d(x,y) Example: LS Family • Let S = sets, d = Jaccard distance, H is family of minhash functions for all permutations of rows • Then for any hash function h in H, Pr[h(x)=h(y)] = 1-d(x,y) • Simply restates theorem about minhashing in terms of distances rather than similarities Example: LS Family – (2) • Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d. If distance < 1/3 (so similarity > 2/3) Then probability that minhash values agree is > 2/3 • For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1 < d2. Amplifying a LS-Family • Can we reproduce the “S-curve” effect we saw before for any LS family? • The “bands” technique we learned for signature matrices carries over to this more general setting. • Two constructions: – AND construction like “rows in a band.” – OR construction like “many bands.” AND of Hash Functions • Given family H, construct family H’ consisting of r functions from H. • For h = [h1,…,hr] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for all i. • Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,(p1)r,(p2)r)-sensitive. • Proof: Use fact that hi ’s are independent. OR of Hash Functions • Given family H, construct family H’ consisting of b functions from H. • For h = [h1,…,hb] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for some i. • Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,1-(1-p1)b,1-(1-p2)b)sensitive. Composing Constructions • r-way AND construction followed by b-way OR construction – Exactly what we did with minhashing • Take points x and y s.t. Pr[h(x) = h(y)] = p – H will make (x,y) a candidate pair with prob. p • This construction will make (x,y) a candidate pair with probability 1-(1-pr)b – The S-Curve! AND-OR Composition • Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4. Table for Function 1-(1-p4)4 p .2 .3 .4 .5 .6 .7 .8 .9 1-(1-p4)4 .0064 .0320 .0985 .2275 .4260 .6666 .8785 .9860 Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)sensitive family. OR-AND Composition • Apply a b-way OR construction followed by an r-way AND construction • Tranforms probability p into (1-(1-p)b)r. – The same S-curve, mirrored horizontally and vertically. • Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4. Table for Function (1-(1-p)4)4 p .1 .2 .3 .4 .5 .6 .7 .8 (1-(1-p)4)4 .0140 .1215 .3334 .5740 .7725 .9015 .9680 .9936 Example:Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)sensitive family. Cascading Constructions • Example: Apply the (4,4) OR-AND construction followed by the (4,4) ANDOR construction. • Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9999996,.0008715)sensitive family. • Note this family uses 256 of the original hash functions. Summary • Pick any two distances x < y • Start with a (x, y, (1-x), (1-y))-sensitive family • Apply constructions to produce (x, y, p, q)sensitive family, where p is almost 1 and q is almost 0. • The closer to 0 and 1 we get, the more hash functions must be used. LSH for Cosine Distance • Random Hypeplanes – Technique similar to minhashing • A (d1,d2,(1-d1/180),(1-d2/180))-sensitive family for any d1 and d2. Random Hyperplanes • Pick a random vector v, which determines a hash function hv with two buckets. • hv(x) = +1 if v.x > 0; = -1 if v.x < 0. • LS-family H = set of all functions derived from any vector. • Claim: For points x and y, Pr[h(x)=h(y)] = 1 – d(x,y)/180 Proof of Claim Look in the plane of x and y. v x θ Hyperplane normal to v h(x) = h(y) v Hyperplane normal to v h(x) ≠ h(y) y Prob[Red case] = θ/180 Signatures for Cosine Distance • Pick some number of random vectors, and hash your data for each vector. • The result is a signature (sketch) of +1’s and –1’s for each data point • Can be used for LSH like the minhash signatures for Jaccard distance. • Amplified using AND and OR constructions How to pick random vectors • Expensive to pick a random vector in M dimensions for large M – M random numbers • A more efficient approach – It suffices to consider only vectors v consisting of +1 and –1 components. – Why is this more efficient? LSH for Euclidean Distance • Simple idea: hash functions correspond to lines. • Partition the line into buckets of size a. • Hash each point to the bucket containing its projection onto the line. • Nearby points are always close; distant points are rarely in same bucket. Projection of Points Points at distance d Bucket width a If d << a, then the chance the points are in the same bucket is at least 1 – d /a. Randomly chosen line Projection of Points If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket. Points at distance d θ d cos θ Bucket width a Randomly chosen line An LS-Family for Euclidean Distance • If points are distance d < a/2, prob. they are in same bucket ≥ 1- d/a = 1/2 • If points are distance > 2a apart, then they can be in the same bucket only if d cos θ ≤ a – cos θ ≤ ½ – 60 < θ < 90 – I.e., at most 1/3 probability. • Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions for any a. • Amplify using AND-OR cascades