I’ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton’s excellent article on Simhashing. Ryan’s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me wondering:
Assuming random (1-bit) inputs, the AND function output probability distribution is 25% 0 and 75% 1. Conversely, OR is 75% 0 and 25% 1.
The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.
The distributions become clear when we chart the output of each of the operations:
If you are combining two random bits with AND you have a 75% chance of 1. Similarly, if you combine two random bits with OR you have a 75% chance of 0.
My friend Rob Zinkov [1] explains it by framing it in terms of entropy:
[When combining hashes] essentially you want an operation that maximizes entropy since lower entropy implies the hash is storing less data than it appears.
With AND and OR lots of the variation in bits is being removed, so the hash you end up with tend to be biased to towards 1s (as in the OR case) or 0s (as in the AND case)
That bias means each bit is storing slightly less information (since that bias is present).
Summary: If you need to combine two hashes, XOR them together and you’ll have a better chance at maintaining the entropy of the original hashes.
[1] Rob runs the L.A. Machine Learning meetup group. If you’re in L.A. AND interested in machine learning, stop by sometime.
One property of xor to keep in mind when doing this is that identical hashes cancel each other out when xoring them bit-wise. For example, the following 2 lists combine to the same hash:
This is because DocA and DocB have the same hash. If these two lists should have different hashes, then a different technique is required.
http://vijaykandy.com Vijay Kandy
Hi Nate,
Google brought me here, because I didn’t understand the use of XOR in Ryan Moulton’s article. I understand now, thanks for the explanation!
I have a question regarding calculating the final hash. Did you try (weighted or not) adding/subtracting of bits of hash values of features as suggested in [1] instead of XORing hashes? If so, can you share your thoughts comparing both approaches? Ryan Moulton’s article mentions [1] in Further Reading section.
The reason I ask is that in my simple tests, I find XORing to combine hashes of features doesn’t result in smaller Hamming distances for similar documents.
Why is XOR the default way to combine hashes
I’ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton’s excellent article on Simhashing. Ryan’s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me wondering:
Why is XOR the default way to combine hashes?
I posted this question on Stackoverflow and got this excellent and concise answer from Greg Hewgill:
The distributions become clear when we chart the output of each of the operations:
If you are combining two random bits with AND you have a 75% chance of 1. Similarly, if you combine two random bits with OR you have a 75% chance of 0.
My friend Rob Zinkov [1] explains it by framing it in terms of entropy:
Summary: If you need to combine two hashes, XOR them together and you’ll have a better chance at maintaining the entropy of the original hashes.
[1] Rob runs the L.A. Machine Learning meetup group. If you’re in L.A. AND interested in machine learning, stop by sometime.