Using strings as python dictionaries (memory management)?

Yes, dict store the key in memory. If you data fit in memory this is the easiest approach Hash should work. Try MD5.It is 16 byte int so collision is unlikely Try BerkeleyDB for a disk based approach.

Yes, dict store the key in memory. If you data fit in memory this is the easiest approach. Hash should work.

Try MD5. It is 16 byte int so collision is unlikely. Try BerkeleyDB for a disk based approach.

– GaretJax Jul 19 at 22:19 BerkeleyDB uses btree indexes so that you have sorted sequences of keys very efficiently.It also handles guaranteed persistence. Redis is not a database, it is a network protocol service. Sending the data to another server is not the same as writing it to a disk.

Of course Redis can be configured to write to a disk as well, but if that is your goal, why bother with Redis in the middle? – Michael Dillon Jul 20 at 5:06 how is BerkeleyDB comparing to SQLite? – ScienceFriction Jul 20 at 5:50 I'm not making any in-depth DB comparison here.

I suggest BerkeleyDB because it works like a disk based persistent dictionary and it comes for free with Python, as least up to 2. X, for free. It does the job with minimal effort, that's it.

– Wai Yip Tung Jul 20 at 19:57 sqlite comes free with python as well. It works as part of the python and no communication with a (even local) server is needed. Unlike BDB, it allows relational operations.

I'll be experimenting with one of them so I'm sniffing around. – ScienceFriction Jul 191 at 13:00.

Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code: for x in xrange(5000000): # it's 5 millions dx = random.

Getrandbits(BITS) For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet: for x in xrange(5000000): # it's 5 millions dx = (random.

Getrandbits(64), random. Getrandbits(64)) It takes 1.1GB of my memory. Conclusion?

If you want to keep two 64-bits integers, use one 128-bits integer, like this: for x in xrange(5000000): # it's still 5 millions dx = random. Getrandbits(64) | (random. Getrandbits(64) It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers.

You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example: hashes = {} for s in sentence: ptr_value = pointer(s) # make it integer hash_value = hash(s) # make it integer if hash_value in hashes: collisions.

Setdefault(hasheshash_value, ). Append(ptr_value) else: hasheshash_value = ptr_value So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with.It sounds pretty hacky, but working with integers is just fine (and fun! ).

Perhaps passing keys to md5 docs.python.org/library/md5.html.

Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.

I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? Is there a reasonable way to do it with python?

If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution. Can someone approve the former paragraph?

This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions