Fast, large-width, non-cryptographic string hashing in python?

Take a look at the 128-bit variant of MurmurHash3 The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. ( Updated the author recommends using the 128-bit variant and throwing away the bits you don't need) If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output Update I did a quick Python wrapper for the Murmur3 hash function Github project is here and you can find it on Python Package Index as well it just needs a C++ compiler to build; no Boost required Usage example and timing comparison: import murmur3 import timeit # without seed print murmur3.

Murmur3_x86_64('samplebias') # with seed value print murmur3. Murmur3_x86_64('samplebias', 123) # timing comparison with str __hash__ t = timeit. Timer("murmur3.

Murmur3_x86_64('hello')", "import murmur3") print 'murmur3:', t.timeit() t = timeit. Timer("str. __hash__('hello')") print 'str.

__hash__:', t.timeit() Output: 15662901497824584782 7997834649920664675 murmur3: 0.264422178268 str. __hash__: 0.219163894653.

Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension.(Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).

If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output. Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.

Usage example and timing comparison: import murmur3 import timeit # without seed print murmur3. Murmur3_x86_64('samplebias') # with seed value print murmur3. Murmur3_x86_64('samplebias', 123) # timing comparison with str __hash__ t = timeit.

Timer("murmur3. Murmur3_x86_64('hello')", "import murmur3") print 'murmur3:', t.timeit() t = timeit. Timer("str.

__hash__('hello')") print 'str. __hash__:', t.timeit() Output: 15662901497824584782 7997834649920664675 murmur3: 0.264422178268 str. __hash__: 0.219163894653.

. I'll take another look at it. I'll let you know how it goes - thanks!

– eblume Mar 23 at 5:00 Yep, it requires Boost Python. On Ubuntu this can be installed with sudo apt-get install libboost-python-dev. I built a package in my PPA as an example.

– samplebias Mar 23 at 6:10 Unfortunately Ubuntu's package management system is still back with python 2.6 so I had to install 2.7 on the side. I could be incredibly dense but it looks like Boost Python has a wickedly difficult manual install. Any tips?

– eblume Mar 23 at 7:28 Yep, I also did a performance test on the Boost-wrapper murmur2 and found it lacking, so I created my own wrapper around murmur3. Check the update above. This should get you going.

:-) – samplebias Mar 23 at 14:51 This is quite fantastic, thanks very much! I have a question though - platform. Cpp has some mentions of processor affinity in it.

The code that will be executing the hashing function is already highly parallelized - I hope that won't cause problems with this package? – eblume Mar 23 at 21:52.

Strings": I'm presuming you wish to hash Python 2. X str objects and/or Python3. X bytes and/or bytearray objects.

This may violate your first constraint, but: consider using something like (zlib. Adler32(strg, perturber).

You are correct in your assumption that I'm hashing str objects - I'll look in to this snippet, thanks, but you're right, I personally doubt that there is consistent entropy to each output bit here. Thanks though! – eblume Mar 23 at 4:58.

If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.

I've been using Python 2.7, but if the hash width in the 3. X engine is definitely, consistently wider then that might be enough to get me to switch. Thanks!

– eblume Mar 23 at 4:56 @eblume: The 64-bit hash on 64-bit Windows is an enhancement in 3.2. 64-bit Linux platforms have always had a 64-bit hash value. 32-bit versions of Python (both Linux and Windows) only have a 32-bit hash value. – casevh Mar 23 at 13:45.

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.

Hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that. Use the string.

__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use!

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Fast, large-width, non-cryptographic string hashing in python?

Related Questions

Meaning of Open hashing and Closed hashing?

Flex clarification needed: width, min(max)Width, explicitWidth, explicitMin(Max)Width, measuredWidth, measuredMinWidth, percentWidth?

What cryptographic algorithms succumb to kleptographic attacks?

Comparing string in Python: String X is ASCII, and String Y is UTF?

Why use hashing to create pathnames for large collections of files?

Minimum window width in string x that contains all characters in string y?