c# - Better 64-bit byte array hash -

i need hash algorithm produces 64-bit hash code (long) fewer collisions string.gethashcode() , fast (no expensive calls cryptographic functions). here's implementation of fnv still shows 3% of collisions after testing 2 million random strings. need number way lower.

void main() {     const string chars = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\";     const int n = 2000000;     var random = new random();     var hashes = new hashset<long>();     int collisions = 0;     for(int = 0; < n; i++)     {         var len = random.next(chars.length);         var str = new char[len];         (int j = 0; j < len; j++)         {             str[j] = chars[random.next(chars.length)];         }         var s = new string(str);         if(!hashes.add(get64bithash( s ))) collisions++;     }     console.writeline("collision percentage after " + n + " random strings: " + ((double)collisions * 100 / n)); }   public long get64bithash(string str) {   unchecked   {      byte[] data = new byte[str.length * sizeof(char)];      system.buffer.blockcopy(str.tochararray(), 0, data, 0, data.length);       const ulong p = 1099511628211ul;      var hash = 14695981039346656037ul;      foreach(var d in data)      {         hash ^= d;         hash *= p;      }      return (long) hash;   } }

sample output of above code:

collision percentage after 2000000 random strings: 3.01485 %

3% same collision percentage calling string.gethashcode(). need way better.

ps: there's chance doing terribly long.

edit: solved. get64bithash method above correct. problem strings weren't random. after making sure strings unique (see revised code below), zero collisions on 50 million unique strings, versus ~1% collisions using string.gethashcode().

void main() {     const string chars = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\";     const int n = 200000000;     var random = new random();     var hashes = new hashset<long>();     var strings = new hashset<string>();     int collisions = 0;     while(strings.count < n)     {         var len = random.next(chars.length);         var str = new char[len];         (int j = 0; j < len; j++)         {             str[j] = chars[random.next(chars.length)];         }         var s = new string(str);         if(!strings.add(s)) continue;         if(!hashes.add(s.gethashcode())) collisions++;     }     console.writeline("collision percentage after " + n + " random strings: " + ((double)collisions * 100 / strings.count)); }

the problem strings aren't random. test string before hashing second time.

Search This Blog

Current CAD

c# - Better 64-bit byte array hash -

Comments

Post a Comment