Why is XOR the default way to combine hashes

I’ve been reading up on Simhashing as a way to find duplicate content in web data. I came across Ryan Moulton’s excellent article on Simhashing. Ryan’s preferred method of Simhashing is to take the n-minimum hash values from the set. In his pseudocode to combine the n hash values he XORs them. This got me wondering:

Why is XOR the default way to combine hashes?

I posted this question on Stackoverflow and got this excellent and concise answer from Greg Hewgill:

Assuming random (1-bit) inputs, the AND function output probability distribution is 25% 0 and 75% 1. Conversely, OR is 75% 0 and 25% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

The distributions become clear when we chart the output of each of the operations:

and-or-xor

If you are combining two random bits with AND you have a 75% chance of 1. Similarly, if you combine two random bits with OR you have a 75% chance of 0.

My friend Rob Zinkov [1] explains it by framing it in terms of entropy:

[When combining hashes] essentially you want an operation that maximizes entropy since lower entropy implies the hash is storing less data than it appears.

With AND and OR lots of the variation in bits is being removed, so the hash you end up with tend to be biased to towards 1s (as in the OR case) or 0s (as in the AND case)

That bias means each bit is storing slightly less information (since that bias is present).

Summary: If you need to combine two hashes, XOR them together and you’ll have a better chance at maintaining the entropy of the original hashes.


[1] Rob runs the L.A. Machine Learning meetup group. If you’re in L.A. AND interested in machine learning, stop by sometime.

Share:
  • del.icio.us
  • Reddit
  • Technorati
  • Twitter
  • Facebook
  • Google Bookmarks
  • HackerNews
  • PDF
  • RSS
This entry was posted in programming. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://quantumlab.net Matt Pulver

    One property of xor to keep in mind when doing this is that identical hashes cancel each other out when xoring them bit-wise. For example, the following 2 lists combine to the same hash:

    23fd68296bc65391799c8c441faf4403c729256f	DocE
    402183e1cbc52e7c87eb230c281f35e4b27c2a39	DocD
    49c31c1459a7603bd5680d11285a5716c4ba3903	DocA
    49c31c1459a7603bd5680d11285a5716c4ba3903	DocB	
    58e5a2035461323a37102e22273c9b25cbb9df61	DocC
    

    and

    23fd68296bc65391799c8c441faf4403c729256f	DocE
    402183e1cbc52e7c87eb230c281f35e4b27c2a39	DocD
    58e5a2035461323a37102e22273c9b25cbb9df61	DocC
    

    This is because DocA and DocB have the same hash. If these two lists should have different hashes, then a different technique is required.

  • http://vijaykandy.com Vijay Kandy

    Hi Nate,

    Google brought me here, because I didn’t understand the use of XOR in Ryan Moulton’s article. I understand now, thanks for the explanation!

    I have a question regarding calculating the final hash. Did you try (weighted or not) adding/subtracting of bits of hash values of features as suggested in [1] instead of XORing hashes? If so, can you share your thoughts comparing both approaches? Ryan Moulton’s article mentions [1] in Further Reading section.

    The reason I ask is that in my simple tests, I find XORing to combine hashes of features doesn’t result in smaller Hamming distances for similar documents.

    [1] http://infolab.stanford.edu/~manku/papers/07www-duplicates.pdf

    Thanks and Regards,
    Vijay Kandy

  • http://twitter.com/TheZachW Zachary Wasserman

    I believe you got your 1s and 0s mixed up in the actual text there. The diagram represents what I would expect.

  • Nate

    Zachary! You’re totally right, I’ll have to update that.