I’ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton’s excellent article on Simhashing. Ryan’s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me wondering:
Why is XOR the default way to combine hashes?
I posted this question on Stackoverflow and got this excellent and concise answer from Greg Hewgill:
Assuming random (1-bit) inputs, the AND function output probability distribution is 25% 0 and 75% 1. Conversely, OR is 75% 0 and 25% 1.
The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.
The distributions become clear when we chart the output of each of the operations:
If you are combining two random bits with AND you have a 75% chance of 1. Similarly, if you combine two random bits with OR you have a 75% chance of 0.
My friend Rob Zinkov  explains it by framing it in terms of entropy:
[When combining hashes] essentially you want an operation that maximizes entropy since lower entropy implies the hash is storing less data than it appears.
With AND and OR lots of the variation in bits is being removed, so the hash you end up with tend to be biased to towards 1s (as in the OR case) or 0s (as in the AND case)
That bias means each bit is storing slightly less information (since that bias is present).
Summary: If you need to combine two hashes, XOR them together and you’ll have a better chance at maintaining the entropy of the original hashes.
 Rob runs the L.A. Machine Learning meetup group. If you’re in L.A. AND interested in machine learning, stop by sometime.