Three suggestions: Use one dictionary It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing Optimize later If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance Read up on hash tables Dictionaries are hash tables and if you are worried about their time or space overhead, you should read up on how they're implemented.
This is basic computer science. The short of it is that hash tables are: average case O(1) lookup time O(n) space (Expect about 2n depending on various parameters) I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables: O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces.
Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not Here are some more resources that might help: The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why the amount of space used by the hash table is proportional to the number of associations in the table This might interest you Here are some things you might consider if you find you actually need to optimize your dictionary implementation: Here is the C source code for Python's dictionaries, in case you want ALL the details. There's copious documentation in here: dictobject.
H dicobject. C Here is a python implementation of that, in case you don't like reading C (Thanks to Ben Peterson ) The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up. Note there's a tradeoff between your load factor and how frequently you need to rehash Rehashes can be costly.
Three suggestions: Use one dictionary. It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing.
Optimize later. If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance.
Read up on hash tables. Dictionaries are hash tables, and if you are worried about their time or space overhead, you should read up on how they're implemented. This is basic computer science.
The short of it is that hash tables are: average case O(1) lookup time O(n) space (Expect about 2n, depending on various parameters) I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables: O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size. O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces.
Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not. Here are some more resources that might help: The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables.
The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why "the amount of space used by the hash table is proportional to the number of associations in the table". This might interest you. Here are some things you might consider if you find you actually need to optimize your dictionary implementation: Here is the C source code for Python's dictionaries, in case you want ALL the details.
There's copious documentation in here: dictobject. H dicobject. C Here is a python implementation of that, in case you don't like reading C.(Thanks to Ben Peterson) The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up.
Note there's a tradeoff between your load factor and how frequently you need to rehash. Rehashes can be costly.
Here's a more up-to-date version of the Python dictionary impl: code.python. Org/loggerhead/users/benjamin. Peterson/pydict/… – Benjamin Peterson Mar 22 '09 at 22:30 can you fix link to "python implementation" – Tshepang May 17 '10 at 17:39 It looks like it's gone -- I got it from Ben.
– tgamblin May 17 '10 at 18:05 can you remove that info then – Tshepang Aug 12 '10 at 21:11.
If you're using Python, you really shouldn't be worrying about this sort of thing in the first place. Just build your data structure the way it best suits your needs, not the computer's. This smacks of premature optimization, not performance improvement.
Profile your code if something is actually bottlenecking, but until then, just let Python do what it does and focus on the actual programming task, and not the underlying mechanics.
Simple" is generally better than "clever", especially if you have no tested reason to go beyond "simple". And anyway "Memory efficient" is an ambiguous term, and there are tradeoffs, when you consider persisting, serializing, cacheing, swapping, and a whole bunch of other stuff that someone else has already thought through so that in most cases you don't need to. Think "Simplest way to handle it properly" optimize much later.
Premature optimization bla bla, don't do it bla bla. I think you're mistaken about the power of two extra allocation does. I think its just a multiplier of two.
X*2, not x^2. I've seen this question a few times on various python mailing lists. With regards to memory, here's a paraphrased version of one such discussion (the post in question wanted to store hundreds of millions integers): A set() is more space efficient than a dict(), if you just want to test for membership gmpy has a bitvector type class for storing dense sets of integers Dicts are kept between 50% and 30% empty, and an entry is about ~12 bytes (though the true amount will vary by platform a bit).
So, the fewer objects you have, the less memory you're going to be using, and the fewer lookups you're going to do (since you'll have to lookup in the index, then a second lookup in the actual value). Like others, said, profile to see your bottlenecks. Keeping an membership set() and value dict() might be faster, but you'll be using more memory.
I'd also suggest reposting this to a python specific list, such as comp.lang. Python, which is full of much more knowledgeable people than myself who would give you all sorts of useful information.
If your dictionary is so big that it does not fit into memory, you might want to have a look at ZODB, a very mature object database for Python. The 'root' of the db has the same interface as a dictionary, and you don't need to load the whole data structure into memory at once e.g. You can iterate over only a portion of the structure by providing start and end keys.It also provides transactions and versioning.
Honestly, you won't be able to tell the difference either way, in terms of either performance or memory usage. Unless you're dealing with tens of millions of items or more, the performance or memory impact is just noise. From the way you worded your second sentence, it sounds like the one big dictionary is your first inclination, and matches more closely with the problem you're trying to solve.
If that's true, go with that. What you'll find about Python is that the solutions that everyone considers 'right' nearly always turn out to be those that are as clear and simple as possible.
Often times, dictionaries of dictionaries are useful for other than performance reasons. Ie, they allow you to store context information about the data without having extra fields on the objects themselves, and make querying subsets of the data faster. In terms of memory usage, it would stand to reason that one large dictionary will use less ram than multiple smaller ones.
Remember, if you're nesting dictionaries, each additional layer of nesting will roughly double the number of dictionaries you need to allocate. In terms of query speed, multiple dicts will take longer due to the increased number of lookups required. So I think the only way to answer this question is for you to profile your own code.
However, my suggestion is to use the method that makes your code the cleanest and easiest to maintain. Of all the features of Python, dictionaries are probably the most heavily tweaked for optimal performance.
Larger dicts obviously have longer lookup times than smaller ones": Wrong. Hashtables are avg. Case O(1) time.
– tgamblin Mar 22 '09 at 19:11 "it would stand to reason that one large dictionary will use less ram than multiple smaller ones": This is wrong too. Hashtables are O(n) space. There's no significant difference between the size of one large dictionary and multiple smaller ones.
See below. – tgamblin Mar 22 '09 at 19:47 @tgambin - as I said in the answer, the space efficiency is due to the creation of multiple dicts. OF COURSE there will be additional space required when when you're allocating more objects.
You're right about lookup speed, though. I modified the answer. – Daniel Naab Mar 22 '09 at 21:33 @Daniel: Unless you are allocating one dictionary per mapping or close to it, then the tiny overhead from each of those extra objects isn't going to matter.
What's INSIDE the dictionaries dominates, and he'll have the same number of mappings with one or multiple dictionaries. – tgamblin Mar 22 '09 at 22:36 @Daniel: "each additional layer of nesting will roughly double the number of dictionaries you need to allocate". If you are doing something like this, then you will end up with O(nlogn) space for n mappings, where log(n) comes from the extra dicts.
– tgamblin Mar 22 '09 at 22:55.
I'm writing an application in Python (2.6) that requires me to use a dictionary as a data store. I am curious as to whether or not it is more memory efficient to have one large dictionary, or to break that down into many (much) smaller dictionaries, then have an "index" dictionary that contains a reference to all the smaller dictionaries. I know there is a lot of overhead in general with lists and dictionaries.
I read somewhere that python internally allocates enough space that the dictionary/list # of items to the power of 2. I'm new enough to python that I'm not sure if there are other unexpected internal complexities/suprises like that, that is not apparent to the average user that I should take into consideration. One of the difficulties is knowing how the power of 2 system counts "items"?
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.