Near Neighbor Search in
High Dimensional Data (2)
Locality-Sensitive Hashing (continued)
LS Families and Amplification
LS Families for Common Distance
Measures

Anand Rajaraman

The Big Picture

Document

Shingling

Minhashing

The set
of strings
of length k
that appear
in the document

Signatures :
short integer
vectors that
represent the
sets, and
reflect their
similarity

Localitysensitive
Hashing

Candidate
pairs :
those pairs
of signatures
that we need
to test for
similarity.

Candidate Pairs
• Pick a similarity threshold s
– e.g., s = 0.8.
– Goal: Find documents with Jaccard similarity
at least s.

• Columns i and j are a candidate pair if
their signatures agree in at least a fraction
s of their rows
• We expect documents i and j to have the
same similarity as their signatures.

LSH for Minhash Signatures
• Big idea: hash columns of signature matrix
M several times.
• Arrange that (only) similar columns are
likely to hash to the same bucket, with
high probability
• Candidate pairs are those that hash to the
same bucket

Partition Into Bands

r rows
per band
b bands

One
signature

Signature Matrix M

Buckets

Matrix M

Columns 2 and 6
are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.

r rows

b bands

Partition into Bands – (2)
• Divide matrix M into b bands of r rows.
– Create one hash table per band

• For each band, hash its portion of each
column to its hash table
• Candidate pairs are columns that hash to
the same bucket for ≥ 1 band.
• Tune b and r to catch most similar pairs,
but few nonsimilar pairs.

Simplifying Assumption
• There are enough buckets that columns
are unlikely to hash to the same bucket
unless they are identical in a particular
band.
• Hereafter, we assume that “same bucket”
means “identical in that band.”
• Assumption needed only to simplify
analysis, not for correctness of algorithm.

Example of bands
• 100 min-hash signatures/document
• Let’s choose choose b = 20, r = 5
– 20 bands, 5 signatures per band

• Goal: find pairs of documents that are at
least 80% similar.

Suppose C1, C2 are 80% Similar
• Probability C1, C2 identical in one
particular band: (0.8)5 = 0.328.
• Probability C1, C2 are not similar in any of
the 20 bands: (1-0.328)20 = .00035 .
– i.e., about 1/3000th of the 80%-similar column
pairs are false negatives
– We would find 99.965% pairs of truly similar
documents

Suppose C1, C2 Only 30% Similar
• Probability C1, C2 identical in any one
particular band: (0.2)5 = 0.00243
• Probability C1, C2 identical in ≥ 1 of 20
bands: 20 * 0.00243 = 0.0486
• In other words, approximately 4.86% pairs
of docs with similarity 30% end up
becoming candidate pairs
– False positives

LSH Involves a Tradeoff
• Pick the number of minhashes, the
number of bands, and the number of rows
per band to balance false positives/
negatives.
• Example: if we had only 15 bands of 5
rows, the number of false positives would
go down, but the number of false
negatives would go up.

Analysis of LSH – What We Want

Probability
= 1 if s > t
Probability
of sharing
a bucket

No chance
if s < t

t
Similarity s of two sets

What One Band of One Row Gives
You

Remember:
probability of
equal hash-values
= similarity

Probability
of sharing
a bucket

t
Similarity s of two sets

b bands, r rows/band
• Columns C and D have similarity s
• Pick any band (r rows)
– Prob. that all rows in band equal = s r
– Prob. that some row in band unequal = 1 - s r

• Prob. that no band identical = (1 - s r)b
• Prob. that at least 1 band identical =
1 - (1 - s r)b

What b Bands of r Rows Gives You
At least
one band
identical

t ~ (1/b)1/r

Probability
of sharing
a bucket

t
Similarity s of two sets

No bands
identical

1 - (1 - s r )b

Some row All rows
of a band of a band
unequal are equal

Example: b = 20; r = 5
s
.2
.3
.4
.5
.6
.7
.8

1-(1-sr)b
.006
.047
.186
.470
.802
.975
.9996

LSH Summary
• Tune to get almost all pairs with similar
signatures, but eliminate most pairs that
do not have similar signatures.
• Check in main memory that candidate
pairs really do have similar signatures.
• Optional: In another pass through data,
check that the remaining candidate
pairs really represent similar
documents.

The Big Picture

Document

Shingling

Minhashing

The set
of strings
of length k
that appear
in the document

Signatures :
short integer
vectors that
represent the
sets, and
reflect their
similarity

Localitysensitive
Hashing

Candidate
pairs :
those pairs
of signatures
that we need
to test for
similarity.

Theory of LSH
• We have used LSH to find similar
documents
– In reality, columns in large sparse matrices
with high Jaccard similarity
– e.g., customer/item purchase histories

• Can we use LSH for other distance
measures?
– e.g., Euclidean distances, Cosine distance
– Let’s generalize what we’ve learned!

Families of Hash Functions
• For min-hash signatures, we got a minhash function for each permutation of rows
• An example of a family of hash functions
– A (large) set of related hash functions
generated by some mechanism
– We should be able to effciently pick a hash
function at random from such a family

Locality-Sensitive (LS) Families
•
•

Suppose we have a space S of points
with a distance measure d.
A family H of hash functions is said to
be (d1,d2,p1,p2)-sensitive if for any x
and y in S :
1. If d(x,y) < d1, then prob. over all h in H,
that h(x) = h(y) is at least p1.
2. If d(x,y) > d2, then prob. over all h in H,
that h(x) = h(y) is at most p2.

A (d1,d2,p1,p2)-sensitive function

p1

Pr[h(x) = h(y)]
p2

d1

d2

d(x,y)

Example: LS Family
• Let S = sets, d = Jaccard distance, H is
family of minhash functions for all
permutations of rows
• Then for any hash function h in H,
Pr[h(x)=h(y)] = 1-d(x,y)
• Simply restates theorem about minhashing in terms of distances rather than
similarities

Example: LS Family – (2)
• Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive
family for S and d.
If distance < 1/3
(so similarity > 2/3)

Then probability
that minhash values
agree is > 2/3

• For Jaccard similarity, minhashing gives
us a (d1,d2,(1-d1),(1-d2))-sensitive family
for any d1 < d2.

Amplifying a LS-Family
• Can we reproduce the “S-curve” effect we
saw before for any LS family?
• The “bands” technique we learned for
signature matrices carries over to this
more general setting.
• Two constructions:
– AND construction like “rows in a band.”
– OR construction like “many bands.”

AND of Hash Functions
• Given family H, construct family H’
consisting of r functions from H.
• For h = [h1,…,hr] in H’, h(x)=h(y) if and
only if hi(x)=hi(y) for all i.
• Theorem: If H is (d1,d2,p1,p2)-sensitive,
then H’ is (d1,d2,(p1)r,(p2)r)-sensitive.
• Proof: Use fact that hi ’s are independent.

OR of Hash Functions
• Given family H, construct family H’
consisting of b functions from H.
• For h = [h1,…,hb] in H’, h(x)=h(y) if and
only if hi(x)=hi(y) for some i.
• Theorem: If H is (d1,d2,p1,p2)-sensitive,
then H’ is (d1,d2,1-(1-p1)b,1-(1-p2)b)sensitive.

Composing Constructions
• r-way AND construction followed by b-way
OR construction
– Exactly what we did with minhashing

• Take points x and y s.t. Pr[h(x) = h(y)] = p
– H will make (x,y) a candidate pair with prob. p

• This construction will make (x,y) a
candidate pair with probability 1-(1-pr)b
– The S-Curve!

AND-OR Composition
• Example: Take H and construct H’ by the
AND construction with r = 4. Then, from
H’, construct H’’ by the OR construction
with b = 4.

Table for Function 1-(1-p4)4
p
.2
.3
.4
.5
.6
.7
.8
.9

1-(1-p4)4
.0064
.0320
.0985
.2275
.4260
.6666
.8785
.9860

Example: Transforms a
(.2,.8,.8,.2)-sensitive
family into a
(.2,.8,.8785,.0064)sensitive family.

OR-AND Composition
• Apply a b-way OR construction followed
by an r-way AND construction
• Tranforms probability p into (1-(1-p)b)r.
– The same S-curve, mirrored horizontally and
vertically.

• Example: Take H and construct H’ by the
OR construction with b = 4. Then, from
H’, construct H’’ by the AND construction
with r = 4.

Table for Function (1-(1-p)4)4
p
.1
.2
.3
.4
.5
.6
.7
.8

(1-(1-p)4)4
.0140
.1215
.3334
.5740
.7725
.9015
.9680
.9936

Example:Transforms a
(.2,.8,.8,.2)-sensitive
family into a
(.2,.8,.9936,.1215)sensitive family.

Cascading Constructions
• Example: Apply the (4,4) OR-AND
construction followed by the (4,4) ANDOR construction.
• Transforms a (.2,.8,.8,.2)-sensitive
family into a (.2,.8,.9999996,.0008715)sensitive family.
• Note this family uses 256 of the original
hash functions.

Summary
• Pick any two distances x < y
• Start with a (x, y, (1-x), (1-y))-sensitive
family
• Apply constructions to produce (x, y, p, q)sensitive family, where p is almost 1 and q
is almost 0.
• The closer to 0 and 1 we get, the more
hash functions must be used.

LSH for Cosine Distance
• Random Hypeplanes
– Technique similar to minhashing

• A (d1,d2,(1-d1/180),(1-d2/180))-sensitive
family for any d1 and d2.

Random Hyperplanes
• Pick a random vector v, which
determines a hash function hv with two
buckets.
• hv(x) = +1 if v.x > 0; = -1 if v.x < 0.
• LS-family H = set of all functions
derived from any vector.
• Claim: For points x and y,
Pr[h(x)=h(y)] = 1 – d(x,y)/180

Proof of Claim
Look in the
plane of x
and y.

v

x

θ
Hyperplane
normal to v
h(x) = h(y)

v

Hyperplane
normal to v
h(x) ≠ h(y)

y
Prob[Red case]
= θ/180

Signatures for Cosine Distance
• Pick some number of random vectors,
and hash your data for each vector.
• The result is a signature (sketch) of +1’s
and –1’s for each data point
• Can be used for LSH like the minhash
signatures for Jaccard distance.
• Amplified using AND and OR
constructions

How to pick random vectors
• Expensive to pick a random vector in M
dimensions for large M
– M random numbers

• A more efficient approach
– It suffices to consider only vectors v
consisting of +1 and –1 components.
– Why is this more efficient?

LSH for Euclidean Distance
• Simple idea: hash functions correspond to
lines.
• Partition the line into buckets of size a.
• Hash each point to the bucket containing
its projection onto the line.
• Nearby points are always close; distant
points are rarely in same bucket.

Projection of Points
Points at
distance d

Bucket
width a

If d << a, then
the chance the
points are in the
same bucket is
at least 1 – d /a.

Randomly
chosen
line

Projection of Points

If d >> a, θ must
be close to 90o
for there to be
any chance points
go to the same
bucket.

Points at
distance d
θ

d cos θ

Bucket
width a

Randomly
chosen
line

An LS-Family for Euclidean Distance
• If points are distance d < a/2, prob. they are in
same bucket ≥ 1- d/a = 1/2
• If points are distance > 2a apart, then they can
be in the same bucket only if d cos θ ≤ a
– cos θ ≤ ½
– 60 < θ < 90
– I.e., at most 1/3 probability.

• Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of
hash functions for any a.
• Amplify using AND-OR cascades