c# - Better 64-bit byte array hash -
i need hash algorithm produces 64-bit hash code (long
) fewer collisions string.gethashcode()
, fast (no expensive calls cryptographic functions). here's implementation of fnv still shows 3% of collisions after testing 2 million random strings. need number way lower.
void main() { const string chars = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\"; const int n = 2000000; var random = new random(); var hashes = new hashset<long>(); int collisions = 0; for(int = 0; < n; i++) { var len = random.next(chars.length); var str = new char[len]; (int j = 0; j < len; j++) { str[j] = chars[random.next(chars.length)]; } var s = new string(str); if(!hashes.add(get64bithash( s ))) collisions++; } console.writeline("collision percentage after " + n + " random strings: " + ((double)collisions * 100 / n)); } public long get64bithash(string str) { unchecked { byte[] data = new byte[str.length * sizeof(char)]; system.buffer.blockcopy(str.tochararray(), 0, data, 0, data.length); const ulong p = 1099511628211ul; var hash = 14695981039346656037ul; foreach(var d in data) { hash ^= d; hash *= p; } return (long) hash; } }
sample output of above code:
collision percentage after 2000000 random strings: 3.01485 %
3% same collision percentage calling string.gethashcode()
. need way better.
ps: there's chance doing terribly long.
edit: solved. get64bithash
method above correct. problem strings weren't random. after making sure strings unique (see revised code below), zero collisions on 50 million unique strings, versus ~1% collisions using string.gethashcode()
.
void main() { const string chars = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\"; const int n = 200000000; var random = new random(); var hashes = new hashset<long>(); var strings = new hashset<string>(); int collisions = 0; while(strings.count < n) { var len = random.next(chars.length); var str = new char[len]; (int j = 0; j < len; j++) { str[j] = chars[random.next(chars.length)]; } var s = new string(str); if(!strings.add(s)) continue; if(!hashes.add(s.gethashcode())) collisions++; } console.writeline("collision percentage after " + n + " random strings: " + ((double)collisions * 100 / strings.count)); }
the problem strings aren't random. test string before hashing second time.
Comments
Post a Comment