Hamming distance on binary strings in SQL

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don’t like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

Leave a Comment