# Why is XOR the default way to combine hashes

I’ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton’s excellent article on Simhashing. Ryan’s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me wondering:

Why is XOR the default way to combine hashes?

I posted this question on Stackoverflow and got this excellent and concise answer from Greg Hewgill:

Assuming random (1-bit) inputs, the AND function output probability distribution is 25% 0 and 75% 1. Conversely, OR is 75% 0 and 25% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

The distributions become clear when we chart the output of each of the operations:

If you are combining two random bits with AND you have a 75% chance of 1. Similarly, if you combine two random bits with OR you have a 75% chance of 0.

My friend Rob Zinkov [1] explains it by framing it in terms of entropy:

[When combining hashes] essentially you want an operation that maximizes entropy since lower entropy implies the hash is storing less data than it appears.

With AND and OR lots of the variation in bits is being removed, so the hash you end up with tend to be biased to towards 1s (as in the OR case) or 0s (as in the AND case)

That bias means each bit is storing slightly less information (since that bias is present).

Summary: If you need to combine two hashes, XOR them together and you’ll have a better chance at maintaining the entropy of the original hashes.

[1] Rob runs the L.A. Machine Learning meetup group. If you’re in L.A. AND interested in machine learning, stop by sometime.

Share:
• http://quantumlab.net Matt Pulver

One property of xor to keep in mind when doing this is that identical hashes cancel each other out when xoring them bit-wise. For example, the following 2 lists combine to the same hash:

```23fd68296bc65391799c8c441faf4403c729256f	DocE
402183e1cbc52e7c87eb230c281f35e4b27c2a39	DocD
49c31c1459a7603bd5680d11285a5716c4ba3903	DocA
49c31c1459a7603bd5680d11285a5716c4ba3903	DocB
58e5a2035461323a37102e22273c9b25cbb9df61	DocC
```

and

```23fd68296bc65391799c8c441faf4403c729256f	DocE
402183e1cbc52e7c87eb230c281f35e4b27c2a39	DocD
58e5a2035461323a37102e22273c9b25cbb9df61	DocC
```

This is because DocA and DocB have the same hash. If these two lists should have different hashes, then a different technique is required.

• http://vijaykandy.com Vijay Kandy

Hi Nate,

Google brought me here, because I didn’t understand the use of XOR in Ryan Moulton’s article. I understand now, thanks for the explanation!

I have a question regarding calculating the final hash. Did you try (weighted or not) adding/subtracting of bits of hash values of features as suggested in [1] instead of XORing hashes? If so, can you share your thoughts comparing both approaches? Ryan Moulton’s article mentions [1] in Further Reading section.

The reason I ask is that in my simple tests, I find XORing to combine hashes of features doesn’t result in smaller Hamming distances for similar documents.

Thanks and Regards,
Vijay Kandy